Thread: Why we lost Uber as a user
Hello, The following article is a very good look at some of our limitations and highlights some of the pains many of us have been working "around" since we started using the software. https://eng.uber.com/mysql-migration/ Specifically: * Inefficient architecture for writes * Inefficient data replication * Issues with table corruption * Poor replica MVCC support * Difficulty upgrading to newer releases It is a very good read and I encourage our hackers to do so with an open mind. Sincerely, JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them. Unless otherwise stated, opinions are my own.
On 07/26/2016 09:54 AM, Joshua D. Drake wrote: > Hello, > > The following article is a very good look at some of our limitations and > highlights some of the pains many of us have been working "around" since > we started using the software. They also had other reasons to switch to MySQL, particularly around changes of staffing (the switch happened after they got a new CTO). And they encountered that 9.2 bug literally the week we released a fix, per one of the mailing lists. Even if they switched off, it's still a nice testimonial that they once ran their entire worldwide fleet off a single Postgres cluster. However, the issues they cite as limitations of our current replication system are real, or we wouldn't have so many people working on alternatives. We could really use pglogical in 10.0, as well as OLTP-friendly MM replication. The write amplification issue, and its correllary in VACUUM, certainly continues to plague some users, and doesn't have any easy solutions. I do find it interesting that they mention schema changes in passing, without actually saying anything about them -- given that schema changes have been one of MySQL's major limitations. I'll also note that they don't mention any of MySQL's corresponding weak spots, such as limitations on table size due to primary key sorting. One wonders what would have happened if they'd adopted a sharding model on top of Postgres? I would like to see someone blog about our testing for replication corruption issues now, in response to this. -- -- Josh Berkus Red Hat OSAS (any opinions are my own)
On 07/26/2016 01:53 PM, Josh Berkus wrote: > The write amplification issue, and its correllary in VACUUM, certainly > continues to plague some users, and doesn't have any easy solutions. To explain this in concrete terms, which the blog post does not: 1. Create a small table, but one with enough rows that indexes make sense (say 50,000 rows). 2. Make this table used in JOINs all over your database. 3. To support these JOINs, index most of the columns in the small table. 4. Now, update that small table 500 times per second. That's a recipe for runaway table bloat; VACUUM can't do much because there's always some minutes-old transaction hanging around (and SNAPSHOT TOO OLD doesn't really help, we're talking about minutes here), and because of all of the indexes HOT isn't effective. Removing the indexes is equally painful because it means less efficient JOINs. The Uber guy is right that InnoDB handles this better as long as you don't touch the primary key (primary key updates in InnoDB are really bad). This is a common problem case we don't have an answer for yet. -- -- Josh Berkus Red Hat OSAS (any opinions are my own)
On Tue, Jul 26, 2016 at 02:26:57PM -0700, Josh Berkus wrote: > On 07/26/2016 01:53 PM, Josh Berkus wrote: > > The write amplification issue, and its correllary in VACUUM, certainly > > continues to plague some users, and doesn't have any easy solutions. > > To explain this in concrete terms, which the blog post does not: > > 1. Create a small table, but one with enough rows that indexes make > sense (say 50,000 rows). > > 2. Make this table used in JOINs all over your database. > > 3. To support these JOINs, index most of the columns in the small table. > > 4. Now, update that small table 500 times per second. > > That's a recipe for runaway table bloat; VACUUM can't do much because > there's always some minutes-old transaction hanging around (and SNAPSHOT > TOO OLD doesn't really help, we're talking about minutes here), and > because of all of the indexes HOT isn't effective. Removing the indexes > is equally painful because it means less efficient JOINs. > > The Uber guy is right that InnoDB handles this better as long as you > don't touch the primary key (primary key updates in InnoDB are really bad). > > This is a common problem case we don't have an answer for yet. Or, basically, we don't have an answer to without making something else worse. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Tue, Jul 26, 2016 at 5:26 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 07/26/2016 01:53 PM, Josh Berkus wrote: >> The write amplification issue, and its correllary in VACUUM, certainly >> continues to plague some users, and doesn't have any easy solutions. > > To explain this in concrete terms, which the blog post does not: > > 1. Create a small table, but one with enough rows that indexes make > sense (say 50,000 rows). > > 2. Make this table used in JOINs all over your database. > > 3. To support these JOINs, index most of the columns in the small table. > > 4. Now, update that small table 500 times per second. > > That's a recipe for runaway table bloat; VACUUM can't do much because > there's always some minutes-old transaction hanging around (and SNAPSHOT > TOO OLD doesn't really help, we're talking about minutes here), and > because of all of the indexes HOT isn't effective. Removing the indexes > is equally painful because it means less efficient JOINs. > > The Uber guy is right that InnoDB handles this better as long as you > don't touch the primary key (primary key updates in InnoDB are really bad). > > This is a common problem case we don't have an answer for yet. This is why I think we need a pluggable heap storage layer, which could be done either by rebranding foreign data wrappers as data wrappers (as I have previously proposed) or using the access method interface (as proposed by Alexander Korotkov) at PGCon. We're reaching the limits of what can be done using our current heap format, and we need to enable developers to experiment with new things. Aside from the possibility of eventually coming up with something that's good enough to completely (or mostly) replace our current heap storage format, we need to support specialized data storage formats that are optimized for particular use cases (columnar, memory-optimized, WORM). I know that people are worried about ending up with too many heap storage formats, but I think we should be a lot more worried about not having enough heap storage formats. Anybody who thinks that the current design is working for all of our users is not paying very close attention. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Josh Berkus <josh@agliodbs.com> writes: > To explain this in concrete terms, which the blog post does not: > 1. Create a small table, but one with enough rows that indexes make > sense (say 50,000 rows). > 2. Make this table used in JOINs all over your database. > 3. To support these JOINs, index most of the columns in the small table. > 4. Now, update that small table 500 times per second. > That's a recipe for runaway table bloat; VACUUM can't do much because > there's always some minutes-old transaction hanging around (and SNAPSHOT > TOO OLD doesn't really help, we're talking about minutes here), and > because of all of the indexes HOT isn't effective. Hm, I'm not following why this is a disaster. OK, you have circa 100% turnover of the table in the lifespan of the slower transactions, but I'd still expect vacuuming to be able to hold the bloat to some small integer multiple of the minimum possible table size. (And if the table is small, that's still small.) I suppose really long transactions (pg_dump?) could be pretty disastrous, but there are ways around that, like doing pg_dump on a slave. Or in short, this seems like an annoyance, not a time-for-a-new-database kind of problem. regards, tom lane
On 07/26/2016 03:07 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> That's a recipe for runaway table bloat; VACUUM can't do much because >> there's always some minutes-old transaction hanging around (and SNAPSHOT >> TOO OLD doesn't really help, we're talking about minutes here), and >> because of all of the indexes HOT isn't effective. > > Hm, I'm not following why this is a disaster. OK, you have circa 100% > turnover of the table in the lifespan of the slower transactions, but I'd > still expect vacuuming to be able to hold the bloat to some small integer > multiple of the minimum possible table size. Not in practice. Don't forget that you also have bloat of the indexes as well. I encountered multiple cases of this particular failure case, and often bloat ended up at something like 100X of the clean table/index size, with no stable size (that is, it always kept growing). This was the original impetus for wanting REINDEX CONCURRENTLY, but really that's kind of a workaround. (And if the table is small, > that's still small.) I suppose really long transactions (pg_dump?) could > be pretty disastrous, but there are ways around that, like doing pg_dump > on a slave. You'd need a dedicated slave for the pg_dump, otherwise you'd hit query cancel. > Or in short, this seems like an annoyance, not a time-for-a-new-database > kind of problem. It's considerably more than an annoyance for the people who suffer from it; for some databases I dealt with, this one issue was responsible for 80% of administrative overhead (cron jobs, reindexing, timeouts ...). But no, it's not a database-switcher *by itself*. But is is a chronic, and serious, problem. I don't have even a suggestion of a real solution for it without breaking something else, though. -- -- Josh Berkus Red Hat OSAS (any opinions are my own)
On Tue, Jul 26, 2016 at 6:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >> To explain this in concrete terms, which the blog post does not: > >> 1. Create a small table, but one with enough rows that indexes make >> sense (say 50,000 rows). > >> 2. Make this table used in JOINs all over your database. > >> 3. To support these JOINs, index most of the columns in the small table. > >> 4. Now, update that small table 500 times per second. > >> That's a recipe for runaway table bloat; VACUUM can't do much because >> there's always some minutes-old transaction hanging around (and SNAPSHOT >> TOO OLD doesn't really help, we're talking about minutes here), and >> because of all of the indexes HOT isn't effective. > > Hm, I'm not following why this is a disaster. OK, you have circa 100% > turnover of the table in the lifespan of the slower transactions, but I'd > still expect vacuuming to be able to hold the bloat to some small integer > multiple of the minimum possible table size. (And if the table is small, > that's still small.) I suppose really long transactions (pg_dump?) could > be pretty disastrous, but there are ways around that, like doing pg_dump > on a slave. > > Or in short, this seems like an annoyance, not a time-for-a-new-database > kind of problem. I've seen multiple cases where this kind of thing causes a sufficiently large performance regression that the system just can't keep up. Things are OK when the table is freshly-loaded, but as soon as somebody runs a query on any table in the cluster that lasts for a minute or two, so much bloat accumulates that the performance drops to an unacceptable level. This kind of thing certainly doesn't happen to everybody, but equally certainly, this isn't the first time I've heard of it being a problem. Sometimes, with careful tending and a very aggressive autovacuum configuration, you can live with it, but it's never a lot of fun. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 27, 2016 at 7:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I've seen multiple cases where this kind of thing causes a > sufficiently large performance regression that the system just can't > keep up. Things are OK when the table is freshly-loaded, but as soon > as somebody runs a query on any table in the cluster that lasts for a > minute or two, so much bloat accumulates that the performance drops to > an unacceptable level. This kind of thing certainly doesn't happen to > everybody, but equally certainly, this isn't the first time I've heard > of it being a problem. Sometimes, with careful tending and a very > aggressive autovacuum configuration, you can live with it, but it's > never a lot of fun. Yes.. That's not fun at all. And it takes days to do this tuning properly if you do such kind of tests on a given product that should work the way its spec certifies it to ease the customer experience. As much as this post is interesting, the comments on HN are a good read as well: https://news.ycombinator.com/item?id=12166585 Some points raised are that the "flaws" mentioned in this post are actually advantages. But I guess this depends on how you want to run your business via your application layer. -- Michael
* Joshua D. Drake (jd@commandprompt.com) wrote: > Hello, > > The following article is a very good look at some of our limitations > and highlights some of the pains many of us have been working > "around" since we started using the software. > > https://eng.uber.com/mysql-migration/ > > Specifically: > > * Inefficient architecture for writes > * Inefficient data replication The above are related and there are serious downsides to having an extra mapping in the middle between the indexes and the heap. What makes me doubt just how well they understood the issues or what is happening is the lack of any mention of hint bits of tuple freezing (requiring additional writes). > * Issues with table corruption That was a bug that was fixed quite quickly once it was detected. The implication that MySQL doesn't have similar bugs is entirely incorrect, as is the idea that logical replication would avoid data corruption issues (in practice, it actually tends to be quite a bit worse). > * Poor replica MVCC support Solved through the hot standby feedback system. > * Difficulty upgrading to newer releases Their specific issue with these upgrades was solved, years ago, by me (and it wasn't particularly difficult to do...) through the use of pg_upgrade's --link option and rsync's ability to construct hard link trees. Making major release upgrades easier with less downtime is certainly a good goal, but there's been a solution to the specific issue they had here for quite a while. Thanks! Stephen
On Tue, Jul 26, 2016 at 8:27 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Joshua D. Drake (jd@commandprompt.com) wrote: >> Hello, >> >> The following article is a very good look at some of our limitations >> and highlights some of the pains many of us have been working >> "around" since we started using the software. >> >> https://eng.uber.com/mysql-migration/ >> >> Specifically: >> >> * Inefficient architecture for writes >> * Inefficient data replication > > The above are related and there are serious downsides to having an extra > mapping in the middle between the indexes and the heap. > > What makes me doubt just how well they understood the issues or what is > happening is the lack of any mention of hint bits of tuple freezing > (requiring additional writes). Yeah. A surprising amount of that post seemed to be devoted to describing how our MVCC architecture works rather than what problem they had with it. I'm not saying we shouldn't take their bad experience seriously - we clearly should - but I don't feel like it's as clear as it could be about exactly where the breakdowns happened. That's why I found Josh's restatement useful - I am assuming without proof that his restatement is accurate.... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 27/07/16 05:45, Robert Haas wrote: > On Tue, Jul 26, 2016 at 8:27 PM, Stephen Frost <sfrost@snowman.net> wrote: >> * Joshua D. Drake (jd@commandprompt.com) wrote: >>> Hello, >>> >>> The following article is a very good look at some of our limitations >>> and highlights some of the pains many of us have been working >>> "around" since we started using the software. >>> >>> https://eng.uber.com/mysql-migration/ >>> >>> Specifically: >>> >>> * Inefficient architecture for writes >>> * Inefficient data replication >> >> The above are related and there are serious downsides to having an extra >> mapping in the middle between the indexes and the heap. >> >> What makes me doubt just how well they understood the issues or what is >> happening is the lack of any mention of hint bits of tuple freezing >> (requiring additional writes). > > Yeah. A surprising amount of that post seemed to be devoted to > describing how our MVCC architecture works rather than what problem > they had with it. I'm not saying we shouldn't take their bad > experience seriously - we clearly should - but I don't feel like it's > as clear as it could be about exactly where the breakdowns happened. There is some more detailed information in this 30-minute talk: https://vimeo.com/145842299 -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Tue, Jul 26, 2016 at 5:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >> To explain this in concrete terms, which the blog post does not: > >> 1. Create a small table, but one with enough rows that indexes make >> sense (say 50,000 rows). > >> 2. Make this table used in JOINs all over your database. > >> 3. To support these JOINs, index most of the columns in the small table. > >> 4. Now, update that small table 500 times per second. > >> That's a recipe for runaway table bloat; VACUUM can't do much because >> there's always some minutes-old transaction hanging around (and SNAPSHOT >> TOO OLD doesn't really help, we're talking about minutes here), and >> because of all of the indexes HOT isn't effective. > > Hm, I'm not following why this is a disaster. OK, you have circa 100% > turnover of the table in the lifespan of the slower transactions, but I'd > still expect vacuuming to be able to hold the bloat to some small integer > multiple of the minimum possible table size. (And if the table is small, > that's still small.) I suppose really long transactions (pg_dump?) could > be pretty disastrous, but there are ways around that, like doing pg_dump > on a slave. > > Or in short, this seems like an annoyance, not a time-for-a-new-database > kind of problem. Well, the real annoyance as I understand it is the raw volume of bytes of WAL traffic a single update of a field can cause. They switched to statement level replication(!). merlin
On Wed, Jul 27, 2016 at 08:33:52AM -0500, Merlin Moncure wrote: > > Or in short, this seems like an annoyance, not a time-for-a-new-database > > kind of problem. > > Well, the real annoyance as I understand it is the raw volume of bytes > of WAL traffic a single update of a field can cause. They switched to > statement level replication(!). Well, their big complaint about binary replication is that a bug can spread from a master to all slaves, which doesn't happen with statement level replication. If that type of corruption is your primary worry, and you can ignore the worries about statement level replication, then it makes sense. Of course, the big tragedy is that statement level replication has known unfixable(?) failures, while binary replication failures are caused by developer-introduced bugs. In some ways, people worry about the bugs they have seen, not the bugs they haven't seen. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 07/26/2016 08:45 PM, Robert Haas wrote: > That's why I found Josh's restatement useful - I am assuming without > proof that his restatement is accurate.... FWIW, my restatement was based on some other sites rather than Uber. Including folks who didn't abandon Postgres. -- -- Josh Berkus Red Hat OSAS (any opinions are my own)
Well, their big complaint about binary replication is that a bug can
spread from a master to all slaves, which doesn't happen with statement
level replication.
I'm not sure that that makes sense to me. If there's a database bug that occurs when you run a statement on the master, it seems there's a decent chance that that same bug is going to occur when you run the same statement on the slave.Obviously it depends on the type of bug and how identical the slave is, but statement-level replication certainly doesn't preclude such a bug from propagating.
Geoff
On 7/28/16, Geoff Winkless <pgsqladmin@geoff.dj> wrote: > On 27 July 2016 at 17:04, Bruce Momjian <bruce@momjian.us> wrote: > >> Well, their big complaint about binary replication is that a bug can >> spread from a master to all slaves, which doesn't happen with statement >> level replication. > > > I'm not sure that that makes sense to me. If there's a database bug that > occurs when you run a statement on the master, it seems there's a decent > chance that that same bug is going to occur when you run the same statement > on the slave. > > Obviously it depends on the type of bug and how identical the slave is, but > statement-level replication certainly doesn't preclude such a bug from > propagating. > > Geoff Please, read the article first! The bug is about wrong visibility of tuples after applying WAL at slaves. For example, you can see two different records selecting from a table by a primary key (moreover, their PKs are the same, but other columns differ). From the article (emphasizing is mine): The following query illustrates how this bug would affect our users table example: SELECT * FROM users WHERE id = 4; This query would return *TWO* records: ... And it affected slaves, not master. Slaves are for decreasing loading to master, if you run all queries (even) RO at master, why would you (or someone) have so many slaves? -- Best regards, Vitaly Burovoy
<p dir="ltr">On 28 Jul 2016 12:19, "Vitaly Burovoy" <<a href="mailto:vitaly.burovoy@gmail.com">vitaly.burovoy@gmail.com</a>>wrote:<br /> ><br /> > On 7/28/16, Geoff Winkless<<a href="mailto:pgsqladmin@geoff.dj">pgsqladmin@geoff.dj</a>> wrote:<br /> > > On 27 July 2016 at 17:04,Bruce Momjian <<a href="mailto:bruce@momjian.us">bruce@momjian.us</a>> wrote:<br /> > ><br /> > >>Well, their big complaint about binary replication is that a bug can<br /> > >> spread from a master toall slaves, which doesn't happen with statement<br /> > >> level replication.<br /> > ><br /> > ><br /> > > I'm not sure that that makes sense to me. If there's a database bug that<br /> > > occurs whenyou run a statement on the master, it seems there's a decent<br /> > > chance that that same bug is going to occurwhen you run the same statement<br /> > > on the slave.<br /> > ><br /> > > Obviously it depends onthe type of bug and how identical the slave is, but<br /> > > statement-level replication certainly doesn't precludesuch a bug from<br /> > > propagating.<br /> > ><br /> > > Geoff<br /> ><br /> > Please,read the article first! The bug is about wrong visibility of<br /> > tuples after applying WAL at slaves.<br />> For example, you can see two different records selecting from a table<br /> > by a primary key (moreover, theirPKs are the same, but other columns<br /> > differ).<p dir="ltr">I read the article. It affected slaves as well asthe master.<p dir="ltr">I quote:<br /> "because of the way replication works, this issue has the potential to spread intoall of the databases in a replication hierarchy"<p dir="ltr">I maintain that this is a nonsense argument. Especiallysince (as you pointed out and as I missed first time around) the bug actually occurred at different records ondifferent slaves, so he invalidates his own point.<p dir="ltr">Geoff
Statement-Based replication has a lot of problems with it like indeterminate UDFs. Here is a link to see them all: https://dev.mysql.com/doc/refman/5.7/en/replication-sbr-rbr.html#replication-sbr-rbr-sbr-disadvantages -- View this message in context: http://postgresql.nabble.com/Why-we-lost-Uber-as-a-user-tp5913417p5913750.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Thu, Jul 28, 2016 at 8:16 AM, pgwhatever <michael@sqlexec.com> wrote: > Statement-Based replication has a lot of problems with it like indeterminate > UDFs. Here is a link to see them all: > https://dev.mysql.com/doc/refman/5.7/en/replication-sbr-rbr.html#replication-sbr-rbr-sbr-disadvantages Sure. It's also incredibly efficient with respect to bandwidth -- so, if you're application was engineered to work around those problems it's a huge win. They could have used pgpool, but I guess the fix was already in. Taking a step back, from the outside, it looks like uber: *) has a very thick middleware, very thin database with respect to logic and complexity *) has a very high priority on quick and cheap (in terms of bandwidth) replication *) has decided the database needs to be interchangeable *) is not afraid to make weak or erroneous technical justifications as a basis of stack selection (the futex vs ipc argument I felt was particularly awful -- it ignored the fact we use spinlocks) The very fact that they swapped it out so easily suggests that they were not utilizing the database as they could have, and a different technical team might have come to a different result. Postgres is a very general system and rewards deep knowledge such that it can outperform even specialty systems in the hands of a capable developer (for example, myself). I'm just now hammering in the final coffin nails that will get solr swapped out for jsonb backed postgres. I guess it's fair to say that they felt mysql is closer to what they felt a database should do out of the box. That's disappointing, but life moves on. The takeaways are: *) people like different choices of replication mechanics -- statement level sucks a lot of the time, but not all the time *) hs/sr simplicity of configuration and operation is a big issue. it's continually gotten better and still needs to *) bad QC can cost you customers. how much regression coverage do we have of hs/sr? *) postgres may not be the ideal choice for those who want a thin and simple database merlin
>> That's a recipe for runaway table bloat; VACUUM can't do much because
>> there's always some minutes-old transaction hanging around (and SNAPSHOT
>> TOO OLD doesn't really help, we're talking about minutes here), and
>> because of all of the indexes HOT isn't effective.
Just curious: what if PostgreSQL supported index that stores "primary key" (or unique key) instead of tids?
Am I right that kind of index would not suffer from that bloat? I'm assuming the primary key is not updated, thus secondary indices build in that way should be much less prone to bloat when updates land to other columns (even if tid moves, its PK does not change, thus secondary index row could be reused).
If that works, it could reduce index bloat, reduce the amount of WAL (less indices will need be updated). Of course it will make index scan a bit worse, however it looks like at least Uber is fine with that extra cost of index scan.
Does it make sense to implement that kind of index as an access method?
Vladimir
On 28.07.2016 17:53, Vladimir Sitnikov wrote:
>> That's a recipe for runaway table bloat; VACUUM can't do much because
>> there's always some minutes-old transaction hanging around (and SNAPSHOT
>> TOO OLD doesn't really help, we're talking about minutes here), and
>> because of all of the indexes HOT isn't effective.Just curious: what if PostgreSQL supported index that stores "primary key" (or unique key) instead of tids?Am I right that kind of index would not suffer from that bloat? I'm assuming the primary key is not updated, thus secondary indices build in that way should be much less prone to bloat when updates land to other columns (even if tid moves, its PK does not change, thus secondary index row could be reused).If that works, it could reduce index bloat, reduce the amount of WAL (less indices will need be updated). Of course it will make index scan a bit worse, however it looks like at least Uber is fine with that extra cost of index scan.Does it make sense to implement that kind of index as an access method?Vladimir
You mean IOT like Oracle have?
Alex Ignatov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 07/28/2016 03:58 AM, Geoff Winkless wrote: > On 27 July 2016 at 17:04, Bruce Momjian <bruce@momjian.us > <mailto:bruce@momjian.us>>wrote: > > Well, their big complaint about binary replication is that a bug can > spread from a master to all slaves, which doesn't happen with statement > level replication. > > > > I'm not sure that that makes sense to me. If there's a database bug > that occurs when you run a statement on the master, it seems there's a > decent chance that that same bug is going to occur when you run the same > statement on the slave. > > Obviously it depends on the type of bug and how identical the slave is, > but statement-level replication certainly doesn't preclude such a bug > from propagating. That's correct, which is why I ignored that part of their post. However, we did have issues for a couple of years where replication accuracy was poorly tested, and did have several bugs associated with that. It's not surprising that a few major users got hit hard by those bugs and decided to switch. -- -- Josh Berkus Red Hat OSAS (any opinions are my own)
On 7/28/16 10:05 AM, Alex Ignatov wrote: >> Just curious: what if PostgreSQL supported index that stores "primary >> key" (or unique key) instead of tids? > > You mean IOT like Oracle have? IIRC, IOT either stores the table in index order, which is something different. What Alex is proposing is an index method that stores a datum instead of a ctid. You would then use that datum to probe a different index to get the ctid. Or put simply, you have a PK index that contains ctid's, and a bunch of other indexes that contain a PK value instead of ctid's. I think it's an idea worth pursuing, but I don't see how you can make it work with our MVCC system unless we drop the aversion to scanning back into an index as part of an update. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
* Jim Nasby (Jim.Nasby@BlueTreble.com) wrote: > On 7/28/16 10:05 AM, Alex Ignatov wrote: > >>Just curious: what if PostgreSQL supported index that stores "primary > >>key" (or unique key) instead of tids? > > > >You mean IOT like Oracle have? > > IIRC, IOT either stores the table in index order, which is something > different. IOT is definitely an interesting idea that I'd like to see us pursue, but I agree that it's something different. > What Alex is proposing is an index method that stores a datum > instead of a ctid. You would then use that datum to probe a > different index to get the ctid. Or put simply, you have a PK index > that contains ctid's, and a bunch of other indexes that contain a PK > value instead of ctid's. Right, that's the MySQL approach, which has advantages and disadvantages. > I think it's an idea worth pursuing, but I don't see how you can > make it work with our MVCC system unless we drop the aversion to > scanning back into an index as part of an update. I'm not terribly excited about the MySQL approach, personally, but I really like the idea of trying to make HOT updates smarter and allow HOT updates for indexes which don't include TIDs, as Robert and Alvaro are discussing. Another thought that was kicking around in my head related to that is if we could have indexes that only provide page-level information (similar to BRIN, but maybe a btree) and which also would allow HOT updates. Those indexes would typically be used in a bitmap index scan where we're going to be doing a bitmap heap scan with a recheck, of course, though I wonder if we could come up with a way to do an in-order bitmap index scan where we sort the tuples on the page and then perform some kind of mergejoin recheck (or just pull out whatever the lowest-not-seen each time we sort the tuples on the page). All very hand-wavy, of course, and it'd make sense to make the concept work for BRIN before we consider anything else, but it seems like there could be a use-case for allowing indexes other than BRIN to be built in a way that allows HOT updates to happen, thus eliminating the cost of having to update those indexes when the tuple is changed, in many cases. Of course, those indexes couldn't be used UNIQUE indexes or used for primary keys, and adjusting the parameters to a BRIN index you could possibly get a similar index, but this might allow such an index to still be usable for index-only scans, which a BRIN index will never be able to provide. Thanks! Stephen
On Fri, Jul 29, 2016 at 10:44:29AM -0400, Stephen Frost wrote: > Another thought that was kicking around in my head related to that is if > we could have indexes that only provide page-level information (similar > to BRIN, but maybe a btree) and which also would allow HOT updates. > Those indexes would typically be used in a bitmap index scan where we're > going to be doing a bitmap heap scan with a recheck, of course, though I > wonder if we could come up with a way to do an in-order bitmap index > scan where we sort the tuples on the page and then perform some kind of > mergejoin recheck (or just pull out whatever the lowest-not-seen each > time we sort the tuples on the page). So allow HOT updates if the updated row is on the same page, even if the indexed column was changed, by scanning the page --- got it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
* Bruce Momjian (bruce@momjian.us) wrote: > On Fri, Jul 29, 2016 at 10:44:29AM -0400, Stephen Frost wrote: > > Another thought that was kicking around in my head related to that is if > > we could have indexes that only provide page-level information (similar > > to BRIN, but maybe a btree) and which also would allow HOT updates. > > Those indexes would typically be used in a bitmap index scan where we're > > going to be doing a bitmap heap scan with a recheck, of course, though I > > wonder if we could come up with a way to do an in-order bitmap index > > scan where we sort the tuples on the page and then perform some kind of > > mergejoin recheck (or just pull out whatever the lowest-not-seen each > > time we sort the tuples on the page). > > So allow HOT updates if the updated row is on the same page, even if the > indexed column was changed, by scanning the page --- got it. The idea I had was to allow creation of indexes which *only* include the page ID. Your rephrase seems to imply that we'd have a regular index but then be able to figure out if a given tuple had any HOT updates performed on it and, if so, scan the entire page. As I understand it, it's more complicated than that because we must involve an index when updating a tuple in some cases (UNIQUE?) and therefore we don't perform HOT in the case where any indexed column is being changed. Of course, this only works if these page-level indexes don't support the features that prevent HOT updates today. If we can tell which existing indexes have been built in a such a way to prevent HOT updates and which would work with a HOT updated tuple, then perhaps we could change the HOT code to check that when it's considering if a tuple can be updated using HOT or not and not have only specific indexes able to support HOT updated tuples. This is clearly all hand-wavy, but if the the BRIN indexes could work in this way then it seems like we should be able to generalize what it is about BRIN that allows it and provide a way for other kinds of indexes to support tuples being HOT updated. It appears to be clearly useful in some use-cases. That's really what I was trying to get at. Thanks! Stephen
On 07/27/2016 12:07 AM, Tom Lane wrote: > >> 4. Now, update that small table 500 times per second. >> That's a recipe for runaway table bloat; VACUUM can't do much because >> there's always some minutes-old transaction hanging around (and SNAPSHOT >> TOO OLD doesn't really help, we're talking about minutes here), and >> because of all of the indexes HOT isn't effective. > Hm, I'm not following why this is a disaster. OK, you have circa 100% > turnover of the table in the lifespan of the slower transactions, but I'd > still expect vacuuming to be able to hold the bloat to some small integer > multiple of the minimum possible table size. (And if the table is small, > that's still small.) I suppose really long transactions (pg_dump?) could > be pretty disastrous, but there are ways around that, like doing pg_dump > on a slave. Is there any theoretical obstacle which would make it impossible to teach VACUUM not to hold back the whole vacuum horizon, but just to leave a single transaction alone in case of a long-running REPEATABLE READ transaction ? -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic Ltd
* Hannu Krosing (hkrosing@gmail.com) wrote: > On 07/27/2016 12:07 AM, Tom Lane wrote: > > > >> 4. Now, update that small table 500 times per second. > >> That's a recipe for runaway table bloat; VACUUM can't do much because > >> there's always some minutes-old transaction hanging around (and SNAPSHOT > >> TOO OLD doesn't really help, we're talking about minutes here), and > >> because of all of the indexes HOT isn't effective. > > Hm, I'm not following why this is a disaster. OK, you have circa 100% > > turnover of the table in the lifespan of the slower transactions, but I'd > > still expect vacuuming to be able to hold the bloat to some small integer > > multiple of the minimum possible table size. (And if the table is small, > > that's still small.) I suppose really long transactions (pg_dump?) could > > be pretty disastrous, but there are ways around that, like doing pg_dump > > on a slave. > Is there any theoretical obstacle which would make it impossible to > teach VACUUM not to hold back the whole vacuum horizon, but just > to leave a single transaction alone in case of a long-running > REPEATABLE READ transaction ? I've looked into this a couple of times and I believe it's possible to calculate what records have to remain available for the long-running transaction, but it's far from trivial. I do think that's a direction which we really need to go in, however. Having a single horizon which is dictated by the oldest running transaction isn't a tenable solution in environments with a lot of churn. Thanks! Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Hannu Krosing (hkrosing@gmail.com) wrote: >> Is there any theoretical obstacle which would make it impossible to >> teach VACUUM not to hold back the whole vacuum horizon, but just >> to leave a single transaction alone in case of a long-running >> REPEATABLE READ transaction ? > I've looked into this a couple of times and I believe it's possible to > calculate what records have to remain available for the long-running > transaction, but it's far from trivial. I think it'd become a lot easier if we went over to representing snapshots as LSN positions (and, concomitantly, had an inexpensive way to translate XIDs to their commit LSNs). That would mean that (1) a backend's snapshot state could be fully exposed in PGPROC, at least up to some small number of active snapshots; (2) it'd be fairly cheap for VACUUM to detect that a dead tuple's XMIN and XMAX are either both before or both after each live snapshot. Someone (Heikki, I think) has been working on this but I've not seen any patch yet. regards, tom lane
<p><br /><br /><div class="moz-cite-prefix">On 7/28/16 4:39 AM, Geoff Winkless wrote:<br /></div><blockquote cite="mid:CAEzk6feRoM6wMqKVsHjfyL0+GrEUDicQ+fbTBmqTtdQsSYCBGw@mail.gmail.com"type="cite"><p dir="ltr">On 28 Jul 2016 12:19,"Vitaly Burovoy" <<a href="mailto:vitaly.burovoy@gmail.com" moz-do-not-send="true">vitaly.burovoy@gmail.com</a>>wrote:<br /> ><br /> > On 7/28/16, Geoff Winkless <<a href="mailto:pgsqladmin@geoff.dj"moz-do-not-send="true">pgsqladmin@geoff.dj</a>> wrote:<br /> > > On 27 July 2016at 17:04, Bruce Momjian <<a href="mailto:bruce@momjian.us" moz-do-not-send="true">bruce@momjian.us</a>> wrote:<br/> > ><br /> > >> Well, their big complaint about binary replication is that a bug can<br /> >>> spread from a master to all slaves, which doesn't happen with statement<br /> > >> level replication.<br/> > ><br /> > > <br /> > > I'm not sure that that makes sense to me. If there's a databasebug that<br /> > > occurs when you run a statement on the master, it seems there's a decent<br /> > >chance that that same bug is going to occur when you run the same statement<br /> > > on the slave.<br /> >><br /> > > Obviously it depends on the type of bug and how identical the slave is, but<br /> > > statement-levelreplication certainly doesn't preclude such a bug from<br /> > > propagating.<br /> > ><br />> > Geoff<br /> ><br /> > Please, read the article first! The bug is about wrong visibility of<br /> >tuples after applying WAL at slaves.<br /> > For example, you can see two different records selecting from a table<br/> > by a primary key (moreover, their PKs are the same, but other columns<br /> > differ).<p dir="ltr">I readthe article. It affected slaves as well as the master.<p dir="ltr">I quote:<br /> "because of the way replication works,this issue has the potential to spread into all of the databases in a replication hierarchy"<p dir="ltr">I maintainthat this is a nonsense argument. Especially since (as you pointed out and as I missed first time around) the bugactually occurred at different records on different slaves, so he invalidates his own point.<p dir="ltr">Geoff</blockquote>Seriously?<br /><br /> There's a valid point here, you're sending over commands at the blocklevel, effectively "write to disk at this location" versus "update this record based on PK", obviously this has somedrawbacks that are reason for concern. Does it validate the move on its own? NO. Does it add to the reasons to moveaway? Yes, that much is obvious.<br /><br /> Please read this thread:<br /><a class="moz-txt-link-freetext" href="https://www.reddit.com/r/programming/comments/4vms8x/why_we_lost_uber_as_a_user_postgresql_mailing_list/d5zx82n">https://www.reddit.com/r/programming/comments/4vms8x/why_we_lost_uber_as_a_user_postgresql_mailing_list/d5zx82n</a><br /><br/> Do I love postgresql? Yes. <br /> Have I been bitten by things such as this? Yes.<br /> Should the community learnfrom these things and think of ways to avoid it? Absolutely!<br /><br /> -Alfred
On 7/28/16 7:08 AM, Merlin Moncure wrote: > > *) postgres may not be the ideal choice for those who want a thin and > simple database This is a huge market, addressing it will bring mindshare and more jobs, code and braintrust to psql. -Alfred
On 7/26/16 9:54 AM, Joshua D. Drake wrote: > Hello, > > The following article is a very good look at some of our limitations > and highlights some of the pains many of us have been working "around" > since we started using the software. > > https://eng.uber.com/mysql-migration/ > > Specifically: > > * Inefficient architecture for writes > * Inefficient data replication > * Issues with table corruption > * Poor replica MVCC support > * Difficulty upgrading to newer releases > > It is a very good read and I encourage our hackers to do so with an > open mind. > > Sincerely, > > JD > It was a good read. Having based a high performance web tracking service as well as a high performance security appliance on Postgresql I too have been bitten by these issues. I had a few questions that maybe the folks with core knowledge can answer: 1) Would it be possible to create a "star-like" schema to fix this problem? For example, let's say you have a table that is similar to Uber's: col0pk, col1, col2, col3, col4, col5 All cols are indexed. Assuming that updates happen to only 1 column at a time. Why not figure out some way to encourage or automate the splitting of this table into multiple tables that present themselves as a single table? What I mean is that you would then wind up with the following tables: table1: col0pk, col1 table2: col0pk, col2 table3: col0pk, col3 table4: col0pk, col4 table5: col0pk, col5 Now when you update "col5" on a row, you only have to update the index on table5:col5 and table5:col0pk as opposed to beforehand where you would have to update more indecies. In addition I believe that vacuum would be somewhat mitigated as well in this case. 2) Why not have a look at how innodb does its storage, would it be possible to do this? 3) For the small-ish table that Uber mentioned, is there a way to "have it in memory" however provide some level of sync to disk so that it is consistent? thanks! -Alfred
On 2 August 2016 at 08:11, Alfred Perlstein <alfred@freebsd.org> wrote: > On 7/2/16 4:39 AM, Geoff Winkless wrote: > > I maintain that this is a nonsense argument. Especially since (as you pointed out and as I missed first time around)the bug actually occurred at different records on different slaves, so he invalidates his own point. > Seriously? No, I make a habit of spouting off random arguments to a list full of people whose opinions I massively respect purely for kicks. What do you think? > There's a valid point here, you're sending over commands at the block level, effectively "write to disk at this location"versus "update this record based on PK", obviously this has some drawbacks that are reason for concern. Writing values directly into file offsets is only problematic if something else has failed that has caused the file to be an inexact copy. If a different bug occurred that caused the primary key to be corrupted on the slave (or indeed the master), PK-based updates would exhibit similar propagation errors. To reiterate my point, uber's described problem came about because of a bug. Every software has bugs at some point in its life, to pretend otherwise is simply naive. I'm not trying to excuse the bug, or to belittle the impact that such a bug has on data integrity or on uber or indeed on the reputation of PostgreSQL. While I'm prepared to accept (because I have a job that requires I spend time on things other than digging through obscure reddits and mailing lists to understand more fully the exact cause) that in _this particular instance_ the bug was propagated because of the replication mechanism (although I'm still dubious about that, as per my comment above), that does _not_ preclude other bugs propagating in a statement-based replication. That's what I said is a nonsense argument, and no-one has yet explained in what way that's incorrect. Geoff
On Sat, Jul 30, 2016 at 12:06 AM, Stephen Frost <sfrost@snowman.net> wrote: > * Bruce Momjian (bruce@momjian.us) wrote: >> On Fri, Jul 29, 2016 at 10:44:29AM -0400, Stephen Frost wrote: >> > Another thought that was kicking around in my head related to that is if >> > we could have indexes that only provide page-level information (similar >> > to BRIN, but maybe a btree) and which also would allow HOT updates. >> > Those indexes would typically be used in a bitmap index scan where we're >> > going to be doing a bitmap heap scan with a recheck, of course, though I >> > wonder if we could come up with a way to do an in-order bitmap index >> > scan where we sort the tuples on the page and then perform some kind of >> > mergejoin recheck (or just pull out whatever the lowest-not-seen each >> > time we sort the tuples on the page). >> >> So allow HOT updates if the updated row is on the same page, even if the >> indexed column was changed, by scanning the page --- got it. > > The idea I had was to allow creation of indexes which *only* include the > page ID. Your rephrase seems to imply that we'd have a regular index > but then be able to figure out if a given tuple had any HOT updates > performed on it and, if so, scan the entire page. As I understand it, > it's more complicated than that because we must involve an index when > updating a tuple in some cases (UNIQUE?) and therefore we don't perform > HOT in the case where any indexed column is being changed. > Why we need to add a record in all indexes if only the key corresponding to one of indexes is updated? Basically, if the tuple can fit on same page, why can't we consider it as HOT (or HPT - heap partial tuple or something like that), unless it updates all the keys for all the indexes. Now, we can't consider such tuple versions for pruning as we do for HOT. The downside of this could be that we might need to retain some of the line pointers for more time (as we won't be able to reuse the line pointer till it is used in any one of the indexes and those could be reused once we make next non-HOT update). However, this should allow us not to update the indexes for which the corresponding column in tuple is not updated. I think it is a basic premise that if any index column is updated then the update will be considered as non-HOT, so there is a good chance that I might be missing something here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Aug 2, 2016 at 5:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Why we need to add a record in all indexes if only the key > corresponding to one of indexes is updated? Basically, if the tuple > can fit on same page, why can't we consider it as HOT (or HPT - heap > partial tuple or something like that), unless it updates all the keys > for all the indexes. Now, we can't consider such tuple versions for > pruning as we do for HOT. The downside of this could be that we might > need to retain some of the line pointers for more time (as we won't be > able to reuse the line pointer till it is used in any one of the > indexes and those could be reused once we make next non-HOT update). > However, this should allow us not to update the indexes for which the > corresponding column in tuple is not updated. I think it is a basic > premise that if any index column is updated then the update will be > considered as non-HOT, so there is a good chance that I might be > missing something here. Well, I think that the biggest advantage of a HOT update is the fact that it enables HOT pruning. In other words, we're not primarily trying to minimize index traffic; we're trying to make cleanup of the heap cheaper. So this could certainly be done, but I'm not sure it would buy us enough to be worth the engineering effort involved. Personally, I think that incremental surgery on our current heap format to try to fix this is not going to get very far. If you look at the history of this, 8.3 was a huge release for timely cleanup of dead tuple. There was also significant progress in 8.4 as a result of 5da9da71c44f27ba48fdad08ef263bf70e43e689. As far as I can recall, we then made no progress at all in 9.0 - 9.4. We made a very small improvement in 9.5 with 94028691609f8e148bd4ce72c46163f018832a5b, but that's pretty niche. In 9.6, we have "snapshot too old", which I'd argue is potentially a large improvement, but it was big and invasive and will no doubt pose code maintenance hazards in the years to come; also, many people won't be able to use it or won't realize that they should use it. I think it is likely that further incremental improvements here will be quite hard to find, and the amount of effort will be large relative to the amount of benefit. I think we need a new storage format where the bloat is cleanly separated from the data rather than intermingled with it; every other major RDMS works that way. Perhaps this is a case of "the grass is greener on the other side of the fence", but I don't think so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2 August 2016 at 15:27, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Aug 2, 2016 at 5:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Why we need to add a record in all indexes if only the key >> corresponding to one of indexes is updated? Basically, if the tuple >> can fit on same page, why can't we consider it as HOT (or HPT - heap >> partial tuple or something like that), unless it updates all the keys >> for all the indexes. Now, we can't consider such tuple versions for >> pruning as we do for HOT. The downside of this could be that we might >> need to retain some of the line pointers for more time (as we won't be >> able to reuse the line pointer till it is used in any one of the >> indexes and those could be reused once we make next non-HOT update). >> However, this should allow us not to update the indexes for which the >> corresponding column in tuple is not updated. I think it is a basic >> premise that if any index column is updated then the update will be >> considered as non-HOT, so there is a good chance that I might be >> missing something here. > > Well, I think that the biggest advantage of a HOT update is the fact > that it enables HOT pruning. In other words, we're not primarily > trying to minimize index traffic; we're trying to make cleanup of the > heap cheaper. So this could certainly be done, but I'm not sure it > would buy us enough to be worth the engineering effort involved. (Hi, just back from leave and catching up on emails.) The above suggested design is something I've been working on for last few days. In my design I referred to "intermediate root" tuples. I've got a detailed design for it and it works, yay!... but Pavan has managed to shoot it down with some accurate observations about it leading to an annoying accumulation of root pointers and complex logic to remove them. So I'm not pursuing it further at this stage. I'm writing up my conclusions around what we should do now, so should post later today. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On Aug 2, 2016, at 2:33 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote: > >> On 2 August 2016 at 08:11, Alfred Perlstein <alfred@freebsd.org> wrote: >>> On 7/2/16 4:39 AM, Geoff Winkless wrote: >>> I maintain that this is a nonsense argument. Especially since (as you pointed out and as I missed first time around)the bug actually occurred at different records on different slaves, so he invalidates his own point. > >> Seriously? > > No, I make a habit of spouting off random arguments to a list full of > people whose opinions I massively respect purely for kicks. What do > you think? > >> There's a valid point here, you're sending over commands at the block level, effectively "write to disk at this location"versus "update this record based on PK", obviously this has some drawbacks that are reason for concern. > > Writing values directly into file offsets is only problematic if > something else has failed that has caused the file to be an inexact > copy. If a different bug occurred that caused the primary key to be > corrupted on the slave (or indeed the master), PK-based updates would > exhibit similar propagation errors. > > To reiterate my point, uber's described problem came about because of > a bug. Every software has bugs at some point in its life, to pretend > otherwise is simply naive. I'm not trying to excuse the bug, or to > belittle the impact that such a bug has on data integrity or on uber > or indeed on the reputation of PostgreSQL. While I'm prepared to > accept (because I have a job that requires I spend time on things > other than digging through obscure reddits and mailing lists to > understand more fully the exact cause) that in _this particular > instance_ the bug was propagated because of the replication mechanism > (although I'm still dubious about that, as per my comment above), that > does _not_ preclude other bugs propagating in a statement-based > replication. That's what I said is a nonsense argument, and no-one has > yet explained in what way that's incorrect. > > Geoff Geoff, You are quite technical, my feeling is that you will understand it, however it will need to be a self learned lesson. -Alfred
On Tue, Aug 2, 2016 at 3:07 PM, Alfred Perlstein <alfred@freebsd.org> wrote: > You are quite technical, my feeling is that you will understand it, however it will need to be a self learned lesson. I don't know what this is supposed to mean, but I think that Geoff's point is somewhat valid. No matter how you replicate data, there is always the possibility that you will replicate any corruption along with the data - or that your copy will be unfaithful to the original. The possible advantage of logical replication rather than physical replication is that any errors you replicate will be logical errors rather than physical errors - so if the heap gets out of step with the indexes on the master, the same problem will not necessarily occur on the slave. On the flip side, despite what Uber found in their environment, physical replication tends to be high-performance because the replay is dead simple. Andres and others have done a good job making our logical decoding facility fast, but I believe it's still slower than plain old physical replication and probably always will be, and the trigger-based logical replication solutions are slower still. Consequently, I believe that both physical and logical replication have advantages, and that's why we should support both of them. Then, each individual user can make the trade-offs they prefer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* Robert Haas (robertmhaas@gmail.com) wrote: > On Tue, Aug 2, 2016 at 3:07 PM, Alfred Perlstein <alfred@freebsd.org> wrote: > > You are quite technical, my feeling is that you will understand it, however it will need to be a self learned lesson. > > I don't know what this is supposed to mean, but I think that Geoff's > point is somewhat valid. No matter how you replicate data, there is > always the possibility that you will replicate any corruption along > with the data - or that your copy will be unfaithful to the original. I believe what Geoff was specifically getting at is probably best demonstrated with an example. Consider a bug in the btree index code which will accept a value but not store it correctly. INSERT INTO mytable (indexed_column) VALUES (-1000000000); /* oops, bug, this value gets stored in the wrong place in the btree */ We happily accept the record and insert it into the btree index, but that insert is incorrect and results in the btree being corrupted because some bug doesn't handle such large values correctly. In such a case, either approach to replication (replicating the query statement, or replicating the changes to the btree page exactly) would result in corruption on the replica. The above represents a bug in *just* the btree side of things (the physical replication did its job correctly, even though the result is a corrupted index on the replica). With physical replication, there is the concern that a bug in *just* the physical (WAL) side of things could cause corruption. That is, we correctly accept and store the value on the primary, but the records generated to send that data to the replica are incorrect and result in an invalid state on the replica. Of course, a bug in the physical side of things which caused corruption would mean that *crash recovery* would also cause corruption. As I understand it, that same concern exists for MySQL, so, moving to logical replication doesn't actually mean you don't need to worry about bugs in the crash recovery side of things, assuming you depend on the database to come back up in a consistent manner after a crash. Thanks! Stephen
Stephen Frost <sfrost@snowman.net> writes: > With physical replication, there is the concern that a bug in *just* the > physical (WAL) side of things could cause corruption. Right. But with logical replication, there's the same risk that the master's state could be fine but a replication bug creates corruption on the slave. Assuming that the logical replication works by issuing valid SQL commands to the slave, one could hope that this sort of "corruption" only extends to having valid data on the slave that fails to match the master. But that's still not a good state to be in. And to the extent that performance concerns lead the implementation to bypass some levels of the SQL engine, you can easily lose that guarantee too. In short, I think Uber's position that logical replication is somehow more reliable than physical is just wishful thinking. If anything, my money would be on the other way around: there's a lot less mechanism that can go wrong in physical replication. Which is not to say there aren't good reasons to use logical replication; I just do not believe that one. regards, tom lane
On 8/2/16 2:14 PM, Tom Lane wrote: > Stephen Frost <sfrost@snowman.net> writes: >> With physical replication, there is the concern that a bug in *just* the >> physical (WAL) side of things could cause corruption. > Right. But with logical replication, there's the same risk that the > master's state could be fine but a replication bug creates corruption on > the slave. > > Assuming that the logical replication works by issuing valid SQL commands > to the slave, one could hope that this sort of "corruption" only extends > to having valid data on the slave that fails to match the master. > But that's still not a good state to be in. And to the extent that > performance concerns lead the implementation to bypass some levels of the > SQL engine, you can easily lose that guarantee too. > > In short, I think Uber's position that logical replication is somehow more > reliable than physical is just wishful thinking. If anything, my money > would be on the other way around: there's a lot less mechanism that can go > wrong in physical replication. Which is not to say there aren't good > reasons to use logical replication; I just do not believe that one. > > regards, tom lane > > The reason it can be less catastrophic is that for logical replication you may futz up your data, but you are safe from corrupting your entire db. Meaning if an update is missed or doubled that may be addressed by a fixup SQL stmt, however if the replication causes a write to the entirely wrong place in the db file then you need to "fsck" your db and hope that nothing super critical was blown away. The impact across a cluster is potentially magnified by physical replication. So for instance, let's say there is a bug in the master's write to disk. The logical replication acts as a barrier from that bad write going to the slaves. With bad writes going to slaves then any corruption experienced on the master will quickly reach the slaves and they too will be corrupted. With logical replication a bug may be stopped at the replication layer. At that point you can resync the slave from the master. Now in the case of physical replication all your base are belong to zuul and you are in a very bad state. That said with logical replication, who's to say that if the statement is replicated to a slave that the slave won't experience the same bug and also corrupt itself. We may be saying the same thing, but still there is something to be said for logical replication... also, didnt they show that logical replication was faster for some use cases at Uber? -Alfred
On Tue, Aug 2, 2016 at 07:30:22PM -0700, Alfred Perlstein wrote: > So for instance, let's say there is a bug in the master's write to disk. > The logical replication acts as a barrier from that bad write going to the > slaves. With bad writes going to slaves then any corruption experienced on > the master will quickly reach the slaves and they too will be corrupted. > > With logical replication a bug may be stopped at the replication layer. At > that point you can resync the slave from the master. > > Now in the case of physical replication all your base are belong to zuul and > you are in a very bad state. > > That said with logical replication, who's to say that if the statement is > replicated to a slave that the slave won't experience the same bug and also > corrupt itself. > > We may be saying the same thing, but still there is something to be said for > logical replication... also, didnt they show that logical replication was > faster for some use cases at Uber? I saw from the Uber article that they weren't going to per-row logical replication but _statement_ replication, which is very hard to do because typical SQL doesn't record what concurrent transactions committed before a new statement's transaction snapshot is taken, and doesn't record lock order for row updates blocked by concurrent activity --- both of which affect the final result from the query. So, for statement replication, it is not a question of whether the code has bugs, but that the replay is not 100% possible in all cases, unless you switch to some statement-row-lock hybrid ability. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 03/08/16 02:27, Robert Haas wrote: > > Personally, I think that incremental surgery on our current heap > format to try to fix this is not going to get very far. If you look > at the history of this, 8.3 was a huge release for timely cleanup of > dead tuple. There was also significant progress in 8.4 as a result of > 5da9da71c44f27ba48fdad08ef263bf70e43e689. As far as I can recall, we > then made no progress at all in 9.0 - 9.4. We made a very small > improvement in 9.5 with 94028691609f8e148bd4ce72c46163f018832a5b, but > that's pretty niche. In 9.6, we have "snapshot too old", which I'd > argue is potentially a large improvement, but it was big and invasive > and will no doubt pose code maintenance hazards in the years to come; > also, many people won't be able to use it or won't realize that they > should use it. I think it is likely that further incremental > improvements here will be quite hard to find, and the amount of effort > will be large relative to the amount of benefit. I think we need a > new storage format where the bloat is cleanly separated from the data > rather than intermingled with it; every other major RDMS works that > way. Perhaps this is a case of "the grass is greener on the other > side of the fence", but I don't think so. > Yeah, I think this is a good summary of the state of play. The only other new db development to use a non-overwriting design like ours that I know of was Jim Starky's Falcon engine for (ironically) Mysql 6.0. Not sure if anyone is still progressing that at all now. I do wonder if Uber could have successfully tamed dead tuple bloat with aggressive per-table autovacuum settings (and if in fact they tried), but as I think Robert said earlier, it is pretty easy to come up with a highly update (or insert + delete) workload that makes for a pretty ugly bloat component even with real aggressive autovacuuming. Cheers Mark
On 29 July 2016 at 15:44, Stephen Frost <sfrost@snowman.net> wrote: > All very hand-wavy, of course, and it'd make sense to make the concept > work for BRIN before we consider anything else, but it seems like there > could be a use-case for allowing indexes other than BRIN to be built in > a way that allows HOT updates to happen, thus eliminating the cost of > having to update those indexes when the tuple is changed, in many cases. > Of course, those indexes couldn't be used UNIQUE indexes or used for > primary keys, and adjusting the parameters to a BRIN index you could > possibly get a similar index, but this might allow such an index to > still be usable for index-only scans, which a BRIN index will never be > able to provide. This idea is vaguely similar to the concepts I've been working on, based on earlier work in 2007. I'm starting a new post for a full discussion. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Aug 3, 2016 at 3:30 AM, Alfred Perlstein <alfred@freebsd.org> wrote: > We may be saying the same thing, but still there is something to be said for > logical replication... also, didnt they show that logical replication was > faster for some use cases at Uber? There is certainly something to be said for logical replication just as there is something to be said for having regular pg_dumps which are logical exports of your database. But neither is a substitute for having real backups or a real standby database. They serve different purposes and solve different problems. But when you have a hardware failure or physical disaster the last thing you want to be doing is failing over to a different database that may or may not have the same data or same behaviour as your former primary. You want to switch over to a standby that is as near as possibly byte for byte identical and will behave exactly the same. If there was a bug in your primary the last time you want to find out about it and have to be dealing with fixing it is when you have a disaster in your primary and need to be back up asap. Honestly the take-away I see in the Uber story is that they apparently had nobody on staff that was on -hackers or apparently even -general and tried to go it alone rather than involve experts from outside their company. As a result they misdiagnosed their problems based on prejudices seeing what they expected to see rather than what the real problem was. -- greg
On Tue, Aug 2, 2016 at 10:33:15PM -0400, Bruce Momjian wrote: > I saw from the Uber article that they weren't going to per-row logical > replication but _statement_ replication, which is very hard to do > because typical SQL doesn't record what concurrent transactions > committed before a new statement's transaction snapshot is taken, and > doesn't record lock order for row updates blocked by concurrent activity > --- both of which affect the final result from the query. > > So, for statement replication, it is not a question of whether the code > has bugs, but that the replay is not 100% possible in all cases, unless > you switch to some statement-row-lock hybrid ability. Oh, and one more problem with statement-level replication is that the overhead of statement replay is high, as high as it was on the master. That leaves minimal server resources left to handle read-only workloads on the slave. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 3 August 2016 at 05:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:
In short, I think Uber's position that logical replication is somehow more
reliable than physical is just wishful thinking. If anything, my money
would be on the other way around: there's a lot less mechanism that can go
wrong in physical replication.
Particularly since they aren't using row-based logical replication, but - it seems - statement based replication. We all know the problems there.
> On Aug 3, 2016, at 3:29 AM, Greg Stark <stark@mit.edu> wrote: > >> > > Honestly the take-away I see in the Uber story is that they apparently > had nobody on staff that was on -hackers or apparently even -general > and tried to go it alone rather than involve experts from outside > their company. As a result they misdiagnosed their problems based on > prejudices seeing what they expected to see rather than what the real > problem was. > +1 very true. At the same time there are some lessons to be learned. At the very least putting in big bold letters where to come for helpis one.
On Wed, Aug 3, 2016 at 8:58 AM, Alfred Perlstein <alfred@freebsd.org> wrote: > On Aug 3, 2016, at 3:29 AM, Greg Stark <stark@mit.edu> wrote: > >> Honestly the take-away I see in the Uber story is that they apparently >> had nobody on staff that was on -hackers or apparently even -general >> and tried to go it alone rather than involve experts from outside >> their company. As a result they misdiagnosed their problems based on >> prejudices seeing what they expected to see rather than what the real >> problem was. > > +1 very true. > > At the same time there are some lessons to be learned. At the > very least putting in big bold letters where to come for help is > one. +1 My initial experience with PostgreSQL would have been entirely different had I not found the community lists and benefited from the assistance and collective wisdom found on them. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3 August 2016 at 15:04, Kevin Grittner <kgrittn@gmail.com> wrote: > My initial experience with PostgreSQL would have been entirely > different had I not found the community lists and benefited from > the assistance and collective wisdom found on them. The top non-sponsored link on google for "postgres support" takes you straight to a page with a link to the mailing lists. I'm not sure that not being able to find them was a problem. I can well imagine that uber wouldn't have wanted to publicise their problems (and so wouldn't have used a mailing list anyway); obviously I've no way of knowing if they contacted any of the support companies in the professional services page - I assume that professional courtesy (and/or NDAs!) would preclude anyone from posting such here anyway. The problem with the professional services page is that the list of companies is very dry, but it might be difficult to improve: as a community it might be considered unreasonable to promote one over the other; however if I had to go searching for professional support (and hadn't seen the level of interaction that some of those companies' employees provide on the mailing lists) I would have no clear idea where to start. Perhaps listing those companies that provide employment for some of the core developers at the top (and explaining so) might be acceptable? (or maybe not just core? you get the idea though). Maybe a separate section for support companies versus hosts? Geoff
On Tue, Aug 2, 2016 at 5:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Stephen Frost <sfrost@snowman.net> writes: >> With physical replication, there is the concern that a bug in *just* the >> physical (WAL) side of things could cause corruption. > > Right. But with logical replication, there's the same risk that the > master's state could be fine but a replication bug creates corruption on > the slave. > > Assuming that the logical replication works by issuing valid SQL commands > to the slave, one could hope that this sort of "corruption" only extends > to having valid data on the slave that fails to match the master. > But that's still not a good state to be in. And to the extent that > performance concerns lead the implementation to bypass some levels of the > SQL engine, you can easily lose that guarantee too. > > In short, I think Uber's position that logical replication is somehow more > reliable than physical is just wishful thinking. If anything, my money > would be on the other way around: there's a lot less mechanism that can go > wrong in physical replication. Which is not to say there aren't good > reasons to use logical replication; I just do not believe that one. I don't think they are saying that logical replication is more reliable than physical replication, nor do I believe that to be true. I think they are saying that if logical corruption happens, you can fix it by typing SQL statements to UPDATE, INSERT, or DELETE the affected rows, whereas if physical corruption happens, there's no equally clear path to recovery. If an index is damaged, you can recreate it; if a heap page is damaged such that you can no longer scan the table, you're going to need expert assistance. And I think there's some point to that. I agree with the general sentiment that they could have gotten further and been more successful with PostgreSQL if they had some expert advice, but I think it's indisputable that recovering a physically corrupted database is generally a lot more painful than one where you only have to fix up some damaged data. Whether we really have data-corrupting WAL-replay bugs sufficiently frequently to make this an ongoing issue rather than a one-time event is also debatable, but nonetheless I don't think their point is completely invalid. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I don't think they are saying that logical replication is more > reliable than physical replication, nor do I believe that to be true. > I think they are saying that if logical corruption happens, you can > fix it by typing SQL statements to UPDATE, INSERT, or DELETE the > affected rows, whereas if physical corruption happens, there's no > equally clear path to recovery. Well, that's not an entirely unreasonable point, but I dispute the implication that it makes recovery from corruption an easy thing to do. How are you going to know what SQL statements to issue? If the master database is changing 24x7, how are you going to keep up with that? I think the realistic answer if you suffer replication-induced corruption is usually going to be "re-clone that slave", and logical rep doesn't really offer much gain in that. regards, tom lane
On 08/03/2016 11:23 AM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I don't think they are saying that logical replication is more >> reliable than physical replication, nor do I believe that to be true. >> I think they are saying that if logical corruption happens, you can >> fix it by typing SQL statements to UPDATE, INSERT, or DELETE the >> affected rows, whereas if physical corruption happens, there's no >> equally clear path to recovery. > > Well, that's not an entirely unreasonable point, but I dispute the > implication that it makes recovery from corruption an easy thing to do. > How are you going to know what SQL statements to issue? If the master > database is changing 24x7, how are you going to keep up with that? > > I think the realistic answer if you suffer replication-induced corruption > is usually going to be "re-clone that slave", and logical rep doesn't > really offer much gain in that. Yes, it actually does. The ability to unsubscribe a set of tables, truncate them and then resubscribe them is vastly superior to having to take a base backup. JD > > regards, tom lane > > -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them. Unless otherwise stated, opinions are my own.
On Wed, Aug 3, 2016 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I don't think they are saying that logical replication is more >> reliable than physical replication, nor do I believe that to be true. >> I think they are saying that if logical corruption happens, you can >> fix it by typing SQL statements to UPDATE, INSERT, or DELETE the >> affected rows, whereas if physical corruption happens, there's no >> equally clear path to recovery. > > Well, that's not an entirely unreasonable point, but I dispute the > implication that it makes recovery from corruption an easy thing to do. > How are you going to know what SQL statements to issue? If the master > database is changing 24x7, how are you going to keep up with that? I think in many cases people fix their data using business logic. For example, suppose your database goes down and you have to run pg_resetxlog to get it back up. You dump-and-restore, as one does, and find that you can't rebuild one of your unique indexes because there are now two records with that same PK. Well, what you do is you look at them and judge which one has the correct data, often the one that looks more complete or the one with the newer timestamp. Or, maybe you need to merge them somehow. In my experience helping users through problems of this type, once you explain the problem to the user and tell them they have to square it on their end, the support call ends. The user may not always be entirely thrilled about having to, say, validate a problematic record against external sources of truth, but they usually know how to do it. Database bugs aren't the only way that databases become inaccurate. If the database that they use to keep track of land ownership in the jurisdiction where I live says that two different people own the same piece of property, somewhere there is a paper deed in a filing cabinet. Fishing that out to understand what happened may not be fun, but a DBA can explain that problem to other people in the organization and those people can get it fixed. It's a problem, but it's fixable. On the other hand, if a heap tuple contains invalid infomask bits that cause an error every time you read the page (this actually happened to an EnterpriseDB customer!), the DBA can't tell other people how to fix it and can't fix it personally either. Instead, the DBA calls me. While I try to figure out what happened and solve the problem, every sequential scan on that table fails, so the customer is basically down. In contrast, in the logical corruption scenario, one record might be wrong, but basically everything is still working. So it's a difference between a problem that the DBA can work with coworkers to fix while the system is up, and a problem that the DBA can't fix and the system is meanwhile down. That's a big difference. > I think the realistic answer if you suffer replication-induced corruption > is usually going to be "re-clone that slave", and logical rep doesn't > really offer much gain in that. If you're using multi-master replication, the notion of what's a slave gets a bit fuzzy, but, apart from that, yes, this is often the solution. However, even here, logical replication can be better. Given the right tools, I can fix up the slave incrementally, comparing it to the master row by row and updating anything that's wrong. If I have to rebuild a physical master, I'm offline. The difference doesn't matter if the slave is so badly corrupted that it's unusable, but it's very common for corruption to involve only a handful of records, and many users not unreasonably prefer a database with a couple of corrupted records to one which is totally down. "Hey, the payroll record for that Tom Lane guy is messed up, don't cut his paycheck until we get that straightened out." "OK, no problem." -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
"Joshua D. Drake" <jd@commandprompt.com> writes: > On 08/03/2016 11:23 AM, Tom Lane wrote: >> I think the realistic answer if you suffer replication-induced corruption >> is usually going to be "re-clone that slave", and logical rep doesn't >> really offer much gain in that. > Yes, it actually does. The ability to unsubscribe a set of tables, > truncate them and then resubscribe them is vastly superior to having to > take a base backup. True, *if* you can circumscribe the corruption to a relatively small part of your database, logical rep might provide more support for a partial re-clone. regards, tom lane
On Wed, Aug 3, 2016 at 2:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> On 08/03/2016 11:23 AM, Tom Lane wrote: >>> I think the realistic answer if you suffer replication-induced corruption >>> is usually going to be "re-clone that slave", and logical rep doesn't >>> really offer much gain in that. > >> Yes, it actually does. The ability to unsubscribe a set of tables, >> truncate them and then resubscribe them is vastly superior to having to >> take a base backup. > > True, *if* you can circumscribe the corruption to a relatively small > part of your database, logical rep might provide more support for a > partial re-clone. When I worked with Wisconsin Courts to migrate their databases to PostgreSQL, we had a DBMS-agnostic logical replication system, and we had a compare program that could be run off-hours as well as having that be a background activity for the replication software to work on during idle time. Either way. a range of rows based on primary key was read on each side and hashed, the hashes compared, and if they didn't match there was a column-by-column compare for each row in the range, with differences listed. This is how we discovered issues like the non-standard handling of backslash mangling our data. Personally, I can't imagine running logical replication of supposedly matching sets of data without something equivalent. Certainly, the courts had source documents to use for resolving any question of the correct value on a mismatch, and I would imagine that many environments would. If you have a meaningful primary key (like a court case number, by which the file folder is physically located), seeing the different values for a specific column in a specific row makes fixes pretty straightforward. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 03.08.2016 21:05, Robert Haas wrote: > On Wed, Aug 3, 2016 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> I don't think they are saying that logical replication is more >>> reliable than physical replication, nor do I believe that to be true. >>> I think they are saying that if logical corruption happens, you can >>> fix it by typing SQL statements to UPDATE, INSERT, or DELETE the >>> affected rows, whereas if physical corruption happens, there's no >>> equally clear path to recovery. >> >> Well, that's not an entirely unreasonable point, but I dispute the >> implication that it makes recovery from corruption an easy thing to do. >> How are you going to know what SQL statements to issue? If the master >> database is changing 24x7, how are you going to keep up with that? > > I think in many cases people fix their data using business logic. For > example, suppose your database goes down and you have to run > pg_resetxlog to get it back up. You dump-and-restore, as one does, > and find that you can't rebuild one of your unique indexes because > there are now two records with that same PK. Well, what you do is you > look at them and judge which one has the correct data, often the one > that looks more complete or the one with the newer timestamp. Or, > maybe you need to merge them somehow. In my experience helping users > through problems of this type, once you explain the problem to the > user and tell them they have to square it on their end, the support > call ends. The user may not always be entirely thrilled about having > to, say, validate a problematic record against external sources of > truth, but they usually know how to do it. Database bugs aren't the > only way that databases become inaccurate. If the database that they > use to keep track of land ownership in the jurisdiction where I live > says that two different people own the same piece of property, > somewhere there is a paper deed in a filing cabinet. Fishing that out > to understand what happened may not be fun, but a DBA can explain that > problem to other people in the organization and those people can get > it fixed. It's a problem, but it's fixable. > > On the other hand, if a heap tuple contains invalid infomask bits that > cause an error every time you read the page (this actually happened to > an EnterpriseDB customer!), the DBA can't tell other people how to fix > it and can't fix it personally either. Instead, the DBA calls me. After reading this statement the ZFS filesystem pops into my mind. It has protection build in against various problems (data degradation, current spikes, phantom writes, etc). For me this raises two questions: 1) would the usage of ZFS prevent such errors? My feeling would say yes, but i have no idea about how a invalid infomask bit could occur. 2) would it be possible to add such prevention to PostgreSQL I know this could add a massive overhead, but it its optional this could be a fine thing? Greetings, Torsten
On 8/4/16 2:00 AM, Torsten Zuehlsdorff wrote: > > > On 03.08.2016 21:05, Robert Haas wrote: >> On Wed, Aug 3, 2016 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Robert Haas <robertmhaas@gmail.com> writes: >>>> I don't think they are saying that logical replication is more >>>> reliable than physical replication, nor do I believe that to be true. >>>> I think they are saying that if logical corruption happens, you can >>>> fix it by typing SQL statements to UPDATE, INSERT, or DELETE the >>>> affected rows, whereas if physical corruption happens, there's no >>>> equally clear path to recovery. >>> >>> Well, that's not an entirely unreasonable point, but I dispute the >>> implication that it makes recovery from corruption an easy thing to do. >>> How are you going to know what SQL statements to issue? If the master >>> database is changing 24x7, how are you going to keep up with that? >> >> I think in many cases people fix their data using business logic. For >> example, suppose your database goes down and you have to run >> pg_resetxlog to get it back up. You dump-and-restore, as one does, >> and find that you can't rebuild one of your unique indexes because >> there are now two records with that same PK. Well, what you do is you >> look at them and judge which one has the correct data, often the one >> that looks more complete or the one with the newer timestamp. Or, >> maybe you need to merge them somehow. In my experience helping users >> through problems of this type, once you explain the problem to the >> user and tell them they have to square it on their end, the support >> call ends. The user may not always be entirely thrilled about having >> to, say, validate a problematic record against external sources of >> truth, but they usually know how to do it. Database bugs aren't the >> only way that databases become inaccurate. If the database that they >> use to keep track of land ownership in the jurisdiction where I live >> says that two different people own the same piece of property, >> somewhere there is a paper deed in a filing cabinet. Fishing that out >> to understand what happened may not be fun, but a DBA can explain that >> problem to other people in the organization and those people can get >> it fixed. It's a problem, but it's fixable. >> >> On the other hand, if a heap tuple contains invalid infomask bits that >> cause an error every time you read the page (this actually happened to >> an EnterpriseDB customer!), the DBA can't tell other people how to fix >> it and can't fix it personally either. Instead, the DBA calls me. > > After reading this statement the ZFS filesystem pops into my mind. It > has protection build in against various problems (data degradation, > current spikes, phantom writes, etc). > > For me this raises two questions: > > 1) would the usage of ZFS prevent such errors? > > My feeling would say yes, but i have no idea about how a invalid > infomask bit could occur. > > 2) would it be possible to add such prevention to PostgreSQL > > I know this could add a massive overhead, but it its optional this > could be a fine thing? Postgresql is very "zfs-like" in its internals. The problem was a bug in postgresql that caused it to just write data to the wrong place. Some vendors use ZFS under databases to provide very cool services such as backup snapshots, test snapshots and other such uses. I think Joyent is one such vendor but I'm not 100% sure. -Alfred
On 8/2/16 10:02 PM, Mark Kirkwood wrote: > On 03/08/16 02:27, Robert Haas wrote: >> >> Personally, I think that incremental surgery on our current heap >> format to try to fix this is not going to get very far. If you look >> at the history of this, 8.3 was a huge release for timely cleanup of >> dead tuple. There was also significant progress in 8.4 as a result of >> 5da9da71c44f27ba48fdad08ef263bf70e43e689. As far as I can recall, we >> then made no progress at all in 9.0 - 9.4. We made a very small >> improvement in 9.5 with 94028691609f8e148bd4ce72c46163f018832a5b, but >> that's pretty niche. In 9.6, we have "snapshot too old", which I'd >> argue is potentially a large improvement, but it was big and invasive >> and will no doubt pose code maintenance hazards in the years to come; >> also, many people won't be able to use it or won't realize that they >> should use it. I think it is likely that further incremental >> improvements here will be quite hard to find, and the amount of effort >> will be large relative to the amount of benefit. I think we need a >> new storage format where the bloat is cleanly separated from the data >> rather than intermingled with it; every other major RDMS works that >> way. Perhaps this is a case of "the grass is greener on the other >> side of the fence", but I don't think so. >> > Yeah, I think this is a good summary of the state of play. > > The only other new db development to use a non-overwriting design like > ours that I know of was Jim Starky's Falcon engine for (ironically) > Mysql 6.0. Not sure if anyone is still progressing that at all now. > > I do wonder if Uber could have successfully tamed dead tuple bloat > with aggressive per-table autovacuum settings (and if in fact they > tried), but as I think Robert said earlier, it is pretty easy to come > up with a highly update (or insert + delete) workload that makes for a > pretty ugly bloat component even with real aggressive autovacuuming. I also wonder if they had used "star schema" which to my understanding would mean multiple tables to replace the single-table that has multiple indecies to work around the write amplification problem in postgresql. > > Cheers > > Mark > > >
On 8/3/16 3:29 AM, Greg Stark wrote: > > Honestly the take-away I see in the Uber story is that they apparently > had nobody on staff that was on -hackers or apparently even -general > and tried to go it alone rather than involve experts from outside > their company. As a result they misdiagnosed their problems based on > prejudices seeing what they expected to see rather than what the real > problem was. > Agree strongly, but there are still lessons to be learned on the psql side. -Alfred
Something I didn't see mentioned that I think is a critical point: last I looked, HOT standby (and presumably SR) replays full page writes. That means that *any* kind of corruption on the master is *guaranteed* to replicate to the slave the next time that block is touched. That's completely the opposite of trigger-based replication. On 8/3/16 3:51 PM, Kevin Grittner wrote: > Personally, I can't imagine running logical replication of > supposedly matching sets of data without something equivalent. I think it depends heavily on the replication solution. I ran londiste for 6+ years with no major issues, but of course there was at least one other major company running that. I also took the time to completely read all the source code; something that's a reasonable prospect with a few thousand lines of python. For streaming rep it's difficult just to draw the line at where the code is. Ultimately, people really need to understand the trade-offs to the different solutions so they can make an informed decision on which ones (yes, plural) they want to use. The same can be said about pg_upgrade vs something else, and the different ways of doing backups. Something I think a lot of folks fail to understand is the value of having a system that has simple technology in the mix. Keeping something like londiste running has a non-zero cost, but the day you discover corruption has replicated through your entire infrastructure you'll probably be REALLY happy you have it. Similarly, I always encourage people to run a weekly or monthly pg_dump if it's at all feasible... just to be safe. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On 17 August 2016 at 08:36, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
-- Something I didn't see mentioned that I think is a critical point: last I looked, HOT standby (and presumably SR) replays full page writes.
Yes, that's right, all WAL-based physical replication replays FPWs.
We could, at the cost of increased WAL size, retain both the original WAL buffer that triggered the FPW and the FPW page image. That's what wal_level = logical does in some cases. I'm not sure it's that compelling though, it just introduces another redo path that can go wrong.
Ultimately, people really need to understand the trade-offs to the different solutions so they can make an informed decision on which ones (yes, plural) they want to use. The same can be said about pg_upgrade vs something else, and the different ways of doing backups.
Right.
It's really bugging me that people are talking about "statement based" replication in MySQL as if it's just sending SQL text around. MySQL's statemnet based replication is a lot smarter than that, and in the actually-works-properly form it's a hybrid of row and statement based replication ("MIXED" mode). As I understand it it lobs around something closer to parsetrees with some values pre-computed rather than SQL text where possible. It stores some computed values of volatile functions in the binlog and reads them from there rather than computing them again when running the statement on replicas, which is why AUTO_INCREMENT etc works. It also falls back to row based replication where necessary for correctness. Even then it has a significant list of caveats, but it's pretty damn impressive. I didn't realise how clever the hybrid system was until recently.
I can see it being desirable to do something like that eventually as an optimisation to logical decoding based replication. Where we can show that the statement is safe or make it safe by doing things like evaluating and substituting volatile function calls, xlog a modified parsetree with oids changed to qualified object names etc, send that when decoding, and execute that on the downstream(s). If there's something we can't show to be safe then replay the logical rows instead. That's way down the track though; I think it's more important to focus on completing logical row-based replication to the point where we handle table rewrites seamlessly and it "just works" first.
On Wed, Aug 17, 2016 at 1:36 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > Something I didn't see mentioned that I think is a critical point: last I > looked, HOT standby (and presumably SR) replays full page writes. That means > that *any* kind of corruption on the master is *guaranteed* to replicate to > the slave the next time that block is touched. That's completely the > opposite of trigger-based replication. Yes, this is exactly what it should be doing and exactly why it's useful. Physical replication accurately replicates the data from the master including "corruption" whereas a logical replication system will not, causing divergence and possible issues during a failover. Picture yourself as Delta, you have a fire in your data centre and go to fail over to your secondary site. Your DBAs inform you that the secondary site has "fixed" some corruption that you were unaware of and wasn't causing any issues and now, in the middle of the business crisis, is when you're going to need to spend time identifying and repairing the problem because your business logic has suddenly started running into problems. Physical replication tries to solve the same use cases as physical backups. They both provide you with exactly what you had prior to the recovery. No more or less. That's what you want when recovering from a disaster. -- greg
On Wed, Aug 17, 2016 at 01:27:18PM +0800, Craig Ringer wrote: > It's really bugging me that people are talking about "statement based" > replication in MySQL as if it's just sending SQL text around. MySQL's statemnet > based replication is a lot smarter than that, and in the > actually-works-properly form it's a hybrid of row and statement based > replication ("MIXED" mode). As I understand it it lobs around something closer > to parsetrees with some values pre-computed rather than SQL text where > possible. It stores some computed values of volatile functions in the binlog > and reads them from there rather than computing them again when running the > statement on replicas, which is why AUTO_INCREMENT etc works. It also falls > back to row based replication where necessary for correctness. Even then it has > a significant list of caveats, but it's pretty damn impressive. I didn't > realise how clever the hybrid system was until recently. > > I can see it being desirable to do something like that eventually as an > optimisation to logical decoding based replication. Where we can show that the > statement is safe or make it safe by doing things like evaluating and > substituting volatile function calls, xlog a modified parsetree with oids > changed to qualified object names etc, send that when decoding, and execute > that on the downstream(s). If there's something we can't show to be safe then > replay the logical rows instead. That's way down the track though; I think it's > more important to focus on completing logical row-based replication to the > point where we handle table rewrites seamlessly and it "just works" first. That was very interesting, and good to know. I assume it also covers concurrent activity issues which I wrote about in this thread, e.g. > I saw from the Uber article that they weren't going to per-row logical > replication but _statement_ replication, which is very hard to do > because typical SQL doesn't record what concurrent transactions > committed before a new statement's transaction snapshot is taken, and > doesn't record lock order for row updates blocked by concurrent activity > --- both of which affect the final result from the query. I assume they can do SQL-level replication when there is no other concurrent activity on the table, and row-based in other cases? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 17 August 2016 at 21:35, Bruce Momjian <bruce@momjian.us> wrote:
> I saw from the Uber article that they weren't going to per-row logical
> replication but _statement_ replication, which is very hard to do
> because typical SQL doesn't record what concurrent transactions
> committed before a new statement's transaction snapshot is taken, and
> doesn't record lock order for row updates blocked by concurrent activity
> --- both of which affect the final result from the query.
I assume they can do SQL-level replication when there is no other
concurrent activity on the table, and row-based in other cases?
I don't know, but wouldn't want to assume that. A quick search suggests they probably define that away as nondeterministic behaviour that's allowed to cause master/replica differences, but no time to look deeply.
On 17 August 2016 at 12:19, Greg Stark <stark@mit.edu> wrote: > On Wed, Aug 17, 2016 at 1:36 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> Something I didn't see mentioned that I think is a critical point: last I >> looked, HOT standby (and presumably SR) replays full page writes. That means >> that *any* kind of corruption on the master is *guaranteed* to replicate to >> the slave the next time that block is touched. That's completely the >> opposite of trigger-based replication. > > Yes, this is exactly what it should be doing and exactly why it's > useful. Physical replication accurately replicates the data from the > master including "corruption" whereas a logical replication system > will not, causing divergence and possible issues during a failover. Yay! Completely agree. Physical replication, as used by DRBD and all other block-level HA solutions, and also used by other databases, such as Oracle. Corruption on the master would often cause errors that would prevent writes and therefore those changes wouldn't even be made, let alone be replicated. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 8/17/16 2:51 PM, Simon Riggs wrote: > On 17 August 2016 at 12:19, Greg Stark <stark@mit.edu> wrote: >> On Wed, Aug 17, 2016 at 1:36 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >>> Something I didn't see mentioned that I think is a critical point: last I >>> looked, HOT standby (and presumably SR) replays full page writes. That means >>> that *any* kind of corruption on the master is *guaranteed* to replicate to >>> the slave the next time that block is touched. That's completely the >>> opposite of trigger-based replication. >> >> Yes, this is exactly what it should be doing and exactly why it's >> useful. Physical replication accurately replicates the data from the >> master including "corruption" whereas a logical replication system >> will not, causing divergence and possible issues during a failover. > > Yay! Completely agree. > > Physical replication, as used by DRBD and all other block-level HA > solutions, and also used by other databases, such as Oracle. > > Corruption on the master would often cause errors that would prevent > writes and therefore those changes wouldn't even be made, let alone be > replicated. My experience has been that you discover corruption after it's already safely on disk, and more than once I've been able to recover by using data on a londiste replica. As I said originally, it's critical to understand the different solutions and the pros and cons of each. There is no magic bullet. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Wed, Aug 17, 2016 at 5:18 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 8/17/16 2:51 PM, Simon Riggs wrote: >> On 17 August 2016 at 12:19, Greg Stark <stark@mit.edu> wrote: >>> Yes, this is exactly what it should be doing and exactly why it's >>> useful. Physical replication accurately replicates the data from the >>> master including "corruption" whereas a logical replication system >>> will not, causing divergence and possible issues during a failover. >> >> >> Yay! Completely agree. >> >> Physical replication, as used by DRBD and all other block-level HA >> solutions, and also used by other databases, such as Oracle. >> >> Corruption on the master would often cause errors that would prevent >> writes and therefore those changes wouldn't even be made, let alone be >> replicated. > > > My experience has been that you discover corruption after it's already > safely on disk, and more than once I've been able to recover by using data > on a londiste replica. > > As I said originally, it's critical to understand the different solutions > and the pros and cons of each. There is no magic bullet. Data point: in the half or so cases I've experienced corruption on replicated systems, in all cases but one the standby was clean. The 'unclean' case actually 8.2 warm standby; the source of the corruption was a very significant bug where prepared statements would write back corrupted data if the table definitions changed under the statement (fixed in 8.3). In that particular case the corruption was very unfortunately quite widespread and passed directly along to the standby server. This bug nearly costed us a user as well although not nearly so famous as uber :-). In the few modern cases I've seen I've not been able to trace it back to any bug in postgres (in particular multixact was ruled out) and I've chalked it up to media or (more likely I think) filesystem problems in the face of a -9 reset. merlin