Thread: Why we lost Uber as a user

Why we lost Uber as a user

From
"Joshua D. Drake"
Date:
Hello,

The following article is a very good look at some of our limitations and 
highlights some of the pains many of us have been working "around" since 
we started using the software.

https://eng.uber.com/mysql-migration/

Specifically:

* Inefficient architecture for writes
* Inefficient data replication
* Issues with table corruption
* Poor replica MVCC support
* Difficulty upgrading to newer releases

It is a very good read and I encourage our hackers to do so with an open 
mind.

Sincerely,

JD

-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.



Re: Why we lost Uber as a user

From
Josh Berkus
Date:
On 07/26/2016 09:54 AM, Joshua D. Drake wrote:
> Hello,
>
> The following article is a very good look at some of our limitations and
> highlights some of the pains many of us have been working "around" since
> we started using the software.

They also had other reasons to switch to MySQL, particularly around
changes of staffing (the switch happened after they got a new CTO).  And
they encountered that 9.2 bug literally the week we released a fix, per
one of the mailing lists. Even if they switched off, it's still a nice
testimonial that they once ran their entire worldwide fleet off a single
Postgres cluster.

However, the issues they cite as limitations of our current replication
system are real, or we wouldn't have so many people working on
alternatives.  We could really use pglogical in 10.0, as well as
OLTP-friendly MM replication.

The write amplification issue, and its correllary in VACUUM, certainly
continues to plague some users, and doesn't have any easy solutions.

I do find it interesting that they mention schema changes in passing,
without actually saying anything about them -- given that schema changes
have been one of MySQL's major limitations.  I'll also note that they
don't mention any of MySQL's corresponding weak spots, such as
limitations on table size due to primary key sorting.

One wonders what would have happened if they'd adopted a sharding model
on top of Postgres?

I would like to see someone blog about our testing for replication
corruption issues now, in response to this.

--
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)



Re: Why we lost Uber as a user

From
Josh Berkus
Date:
On 07/26/2016 01:53 PM, Josh Berkus wrote:
> The write amplification issue, and its correllary in VACUUM, certainly
> continues to plague some users, and doesn't have any easy solutions.

To explain this in concrete terms, which the blog post does not:

1. Create a small table, but one with enough rows that indexes make
sense (say 50,000 rows).

2. Make this table used in JOINs all over your database.

3. To support these JOINs, index most of the columns in the small table.

4. Now, update that small table 500 times per second.

That's a recipe for runaway table bloat; VACUUM can't do much because
there's always some minutes-old transaction hanging around (and SNAPSHOT
TOO OLD doesn't really help, we're talking about minutes here), and
because of all of the indexes HOT isn't effective.  Removing the indexes
is equally painful because it means less efficient JOINs.

The Uber guy is right that InnoDB handles this better as long as you
don't touch the primary key (primary key updates in InnoDB are really bad).

This is a common problem case we don't have an answer for yet.

--
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)



Re: Why we lost Uber as a user

From
Bruce Momjian
Date:
On Tue, Jul 26, 2016 at 02:26:57PM -0700, Josh Berkus wrote:
> On 07/26/2016 01:53 PM, Josh Berkus wrote:
> > The write amplification issue, and its correllary in VACUUM, certainly
> > continues to plague some users, and doesn't have any easy solutions.
> 
> To explain this in concrete terms, which the blog post does not:
> 
> 1. Create a small table, but one with enough rows that indexes make
> sense (say 50,000 rows).
> 
> 2. Make this table used in JOINs all over your database.
> 
> 3. To support these JOINs, index most of the columns in the small table.
> 
> 4. Now, update that small table 500 times per second.
> 
> That's a recipe for runaway table bloat; VACUUM can't do much because
> there's always some minutes-old transaction hanging around (and SNAPSHOT
> TOO OLD doesn't really help, we're talking about minutes here), and
> because of all of the indexes HOT isn't effective.  Removing the indexes
> is equally painful because it means less efficient JOINs.
> 
> The Uber guy is right that InnoDB handles this better as long as you
> don't touch the primary key (primary key updates in InnoDB are really bad).
> 
> This is a common problem case we don't have an answer for yet.

Or, basically, we don't have an answer to without making something else
worse.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Why we lost Uber as a user

From
Robert Haas
Date:
On Tue, Jul 26, 2016 at 5:26 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/26/2016 01:53 PM, Josh Berkus wrote:
>> The write amplification issue, and its correllary in VACUUM, certainly
>> continues to plague some users, and doesn't have any easy solutions.
>
> To explain this in concrete terms, which the blog post does not:
>
> 1. Create a small table, but one with enough rows that indexes make
> sense (say 50,000 rows).
>
> 2. Make this table used in JOINs all over your database.
>
> 3. To support these JOINs, index most of the columns in the small table.
>
> 4. Now, update that small table 500 times per second.
>
> That's a recipe for runaway table bloat; VACUUM can't do much because
> there's always some minutes-old transaction hanging around (and SNAPSHOT
> TOO OLD doesn't really help, we're talking about minutes here), and
> because of all of the indexes HOT isn't effective.  Removing the indexes
> is equally painful because it means less efficient JOINs.
>
> The Uber guy is right that InnoDB handles this better as long as you
> don't touch the primary key (primary key updates in InnoDB are really bad).
>
> This is a common problem case we don't have an answer for yet.

This is why I think we need a pluggable heap storage layer, which
could be done either by rebranding foreign data wrappers as data
wrappers (as I have previously proposed) or using the access method
interface (as proposed by Alexander Korotkov) at PGCon.  We're
reaching the limits of what can be done using our current heap format,
and we need to enable developers to experiment with new things.  Aside
from the possibility of eventually coming up with something that's
good enough to completely (or mostly) replace our current heap storage
format, we need to support specialized data storage formats that are
optimized for particular use cases (columnar, memory-optimized, WORM).
I know that people are worried about ending up with too many heap
storage formats, but I think we should be a lot more worried about not
having enough heap storage formats.  Anybody who thinks that the
current design is working for all of our users is not paying very
close attention.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why we lost Uber as a user

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> To explain this in concrete terms, which the blog post does not:

> 1. Create a small table, but one with enough rows that indexes make
> sense (say 50,000 rows).

> 2. Make this table used in JOINs all over your database.

> 3. To support these JOINs, index most of the columns in the small table.

> 4. Now, update that small table 500 times per second.

> That's a recipe for runaway table bloat; VACUUM can't do much because
> there's always some minutes-old transaction hanging around (and SNAPSHOT
> TOO OLD doesn't really help, we're talking about minutes here), and
> because of all of the indexes HOT isn't effective.

Hm, I'm not following why this is a disaster.  OK, you have circa 100%
turnover of the table in the lifespan of the slower transactions, but I'd
still expect vacuuming to be able to hold the bloat to some small integer
multiple of the minimum possible table size.  (And if the table is small,
that's still small.)  I suppose really long transactions (pg_dump?) could
be pretty disastrous, but there are ways around that, like doing pg_dump
on a slave.

Or in short, this seems like an annoyance, not a time-for-a-new-database
kind of problem.
        regards, tom lane



Re: Why we lost Uber as a user

From
Josh Berkus
Date:
On 07/26/2016 03:07 PM, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:

>> That's a recipe for runaway table bloat; VACUUM can't do much because
>> there's always some minutes-old transaction hanging around (and SNAPSHOT
>> TOO OLD doesn't really help, we're talking about minutes here), and
>> because of all of the indexes HOT isn't effective.
>
> Hm, I'm not following why this is a disaster.  OK, you have circa 100%
> turnover of the table in the lifespan of the slower transactions, but I'd
> still expect vacuuming to be able to hold the bloat to some small integer
> multiple of the minimum possible table size.

Not in practice.  Don't forget that you also have bloat of the indexes
as well.  I encountered multiple cases of this particular failure case,
and often bloat ended up at something like 100X of the clean table/index
size, with no stable size (that is, it always kept growing).  This was
the original impetus for wanting REINDEX CONCURRENTLY, but really that's
kind of a workaround.
 (And if the table is small,
> that's still small.)  I suppose really long transactions (pg_dump?) could
> be pretty disastrous, but there are ways around that, like doing pg_dump
> on a slave.

You'd need a dedicated slave for the pg_dump, otherwise you'd hit query
cancel.

> Or in short, this seems like an annoyance, not a time-for-a-new-database
> kind of problem.

It's considerably more than an annoyance for the people who suffer from
it; for some databases I dealt with, this one issue was responsible for
80% of administrative overhead (cron jobs, reindexing, timeouts ...).

But no, it's not a database-switcher *by itself*.  But is is a chronic,
and serious, problem.  I don't have even a suggestion of a real solution
for it without breaking something else, though.

--
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)



Re: Why we lost Uber as a user

From
Robert Haas
Date:
On Tue, Jul 26, 2016 at 6:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> To explain this in concrete terms, which the blog post does not:
>
>> 1. Create a small table, but one with enough rows that indexes make
>> sense (say 50,000 rows).
>
>> 2. Make this table used in JOINs all over your database.
>
>> 3. To support these JOINs, index most of the columns in the small table.
>
>> 4. Now, update that small table 500 times per second.
>
>> That's a recipe for runaway table bloat; VACUUM can't do much because
>> there's always some minutes-old transaction hanging around (and SNAPSHOT
>> TOO OLD doesn't really help, we're talking about minutes here), and
>> because of all of the indexes HOT isn't effective.
>
> Hm, I'm not following why this is a disaster.  OK, you have circa 100%
> turnover of the table in the lifespan of the slower transactions, but I'd
> still expect vacuuming to be able to hold the bloat to some small integer
> multiple of the minimum possible table size.  (And if the table is small,
> that's still small.)  I suppose really long transactions (pg_dump?) could
> be pretty disastrous, but there are ways around that, like doing pg_dump
> on a slave.
>
> Or in short, this seems like an annoyance, not a time-for-a-new-database
> kind of problem.

I've seen multiple cases where this kind of thing causes a
sufficiently large performance regression that the system just can't
keep up.  Things are OK when the table is freshly-loaded, but as soon
as somebody runs a query on any table in the cluster that lasts for a
minute or two, so much bloat accumulates that the performance drops to
an unacceptable level.  This kind of thing certainly doesn't happen to
everybody, but equally certainly, this isn't the first time I've heard
of it being a problem.  Sometimes, with careful tending and a very
aggressive autovacuum configuration, you can live with it, but it's
never a lot of fun.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why we lost Uber as a user

From
Michael Paquier
Date:
On Wed, Jul 27, 2016 at 7:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I've seen multiple cases where this kind of thing causes a
> sufficiently large performance regression that the system just can't
> keep up.  Things are OK when the table is freshly-loaded, but as soon
> as somebody runs a query on any table in the cluster that lasts for a
> minute or two, so much bloat accumulates that the performance drops to
> an unacceptable level.  This kind of thing certainly doesn't happen to
> everybody, but equally certainly, this isn't the first time I've heard
> of it being a problem.  Sometimes, with careful tending and a very
> aggressive autovacuum configuration, you can live with it, but it's
> never a lot of fun.

Yes.. That's not fun at all. And it takes days to do this tuning
properly if you do such kind of tests on a given product that should
work the way its spec certifies it to ease the customer experience.

As much as this post is interesting, the comments on HN are a good read as well:
https://news.ycombinator.com/item?id=12166585
Some points raised are that the "flaws" mentioned in this post are
actually advantages. But I guess this depends on how you want to run
your business via your application layer.
-- 
Michael



Re: Why we lost Uber as a user

From
Stephen Frost
Date:
* Joshua D. Drake (jd@commandprompt.com) wrote:
> Hello,
>
> The following article is a very good look at some of our limitations
> and highlights some of the pains many of us have been working
> "around" since we started using the software.
>
> https://eng.uber.com/mysql-migration/
>
> Specifically:
>
> * Inefficient architecture for writes
> * Inefficient data replication

The above are related and there are serious downsides to having an extra
mapping in the middle between the indexes and the heap.

What makes me doubt just how well they understood the issues or what is
happening is the lack of any mention of hint bits of tuple freezing
(requiring additional writes).

> * Issues with table corruption

That was a bug that was fixed quite quickly once it was detected.  The
implication that MySQL doesn't have similar bugs is entirely incorrect,
as is the idea that logical replication would avoid data corruption
issues (in practice, it actually tends to be quite a bit worse).

> * Poor replica MVCC support

Solved through the hot standby feedback system.

> * Difficulty upgrading to newer releases

Their specific issue with these upgrades was solved, years ago, by me
(and it wasn't particularly difficult to do...) through the use of
pg_upgrade's --link option and rsync's ability to construct hard link
trees.  Making major release upgrades easier with less downtime is
certainly a good goal, but there's been a solution to the specific issue
they had here for quite a while.

Thanks!

Stephen

Re: Why we lost Uber as a user

From
Robert Haas
Date:
On Tue, Jul 26, 2016 at 8:27 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Joshua D. Drake (jd@commandprompt.com) wrote:
>> Hello,
>>
>> The following article is a very good look at some of our limitations
>> and highlights some of the pains many of us have been working
>> "around" since we started using the software.
>>
>> https://eng.uber.com/mysql-migration/
>>
>> Specifically:
>>
>> * Inefficient architecture for writes
>> * Inefficient data replication
>
> The above are related and there are serious downsides to having an extra
> mapping in the middle between the indexes and the heap.
>
> What makes me doubt just how well they understood the issues or what is
> happening is the lack of any mention of hint bits of tuple freezing
> (requiring additional writes).

Yeah.  A surprising amount of that post seemed to be devoted to
describing how our MVCC architecture works rather than what problem
they had with it.  I'm not saying we shouldn't take their bad
experience seriously - we clearly should - but I don't feel like it's
as clear as it could be about exactly where the breakdowns happened.
That's why I found Josh's restatement useful - I am assuming without
proof that his restatement is accurate....

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why we lost Uber as a user

From
Vik Fearing
Date:
On 27/07/16 05:45, Robert Haas wrote:
> On Tue, Jul 26, 2016 at 8:27 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> * Joshua D. Drake (jd@commandprompt.com) wrote:
>>> Hello,
>>>
>>> The following article is a very good look at some of our limitations
>>> and highlights some of the pains many of us have been working
>>> "around" since we started using the software.
>>>
>>> https://eng.uber.com/mysql-migration/
>>>
>>> Specifically:
>>>
>>> * Inefficient architecture for writes
>>> * Inefficient data replication
>>
>> The above are related and there are serious downsides to having an extra
>> mapping in the middle between the indexes and the heap.
>>
>> What makes me doubt just how well they understood the issues or what is
>> happening is the lack of any mention of hint bits of tuple freezing
>> (requiring additional writes).
> 
> Yeah.  A surprising amount of that post seemed to be devoted to
> describing how our MVCC architecture works rather than what problem
> they had with it.  I'm not saying we shouldn't take their bad
> experience seriously - we clearly should - but I don't feel like it's
> as clear as it could be about exactly where the breakdowns happened.

There is some more detailed information in this 30-minute talk:
https://vimeo.com/145842299
-- 
Vik Fearing                                          +33 6 46 75 15 36
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support



Re: Why we lost Uber as a user

From
Merlin Moncure
Date:
On Tue, Jul 26, 2016 at 5:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> To explain this in concrete terms, which the blog post does not:
>
>> 1. Create a small table, but one with enough rows that indexes make
>> sense (say 50,000 rows).
>
>> 2. Make this table used in JOINs all over your database.
>
>> 3. To support these JOINs, index most of the columns in the small table.
>
>> 4. Now, update that small table 500 times per second.
>
>> That's a recipe for runaway table bloat; VACUUM can't do much because
>> there's always some minutes-old transaction hanging around (and SNAPSHOT
>> TOO OLD doesn't really help, we're talking about minutes here), and
>> because of all of the indexes HOT isn't effective.
>
> Hm, I'm not following why this is a disaster.  OK, you have circa 100%
> turnover of the table in the lifespan of the slower transactions, but I'd
> still expect vacuuming to be able to hold the bloat to some small integer
> multiple of the minimum possible table size.  (And if the table is small,
> that's still small.)  I suppose really long transactions (pg_dump?) could
> be pretty disastrous, but there are ways around that, like doing pg_dump
> on a slave.
>
> Or in short, this seems like an annoyance, not a time-for-a-new-database
> kind of problem.

Well, the real annoyance as I understand it is the raw volume of bytes
of WAL traffic a single update of a field can cause.  They switched to
statement level replication(!).

merlin



Re: Why we lost Uber as a user

From
Bruce Momjian
Date:
On Wed, Jul 27, 2016 at 08:33:52AM -0500, Merlin Moncure wrote:
> > Or in short, this seems like an annoyance, not a time-for-a-new-database
> > kind of problem.
> 
> Well, the real annoyance as I understand it is the raw volume of bytes
> of WAL traffic a single update of a field can cause.  They switched to
> statement level replication(!).

Well, their big complaint about binary replication is that a bug can
spread from a master to all slaves, which doesn't happen with statement
level replication.  If that type of corruption is your primary worry,
and you can ignore the worries about statement level replication, then
it makes sense.  Of course, the big tragedy is that statement level
replication has known unfixable(?) failures, while binary replication
failures are caused by developer-introduced bugs.

In some ways, people worry about the bugs they have seen, not the bugs
they haven't seen.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Why we lost Uber as a user

From
Josh Berkus
Date:
On 07/26/2016 08:45 PM, Robert Haas wrote:
> That's why I found Josh's restatement useful - I am assuming without
> proof that his restatement is accurate....

FWIW, my restatement was based on some other sites rather than Uber.
Including folks who didn't abandon Postgres.

-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)



Re: Why we lost Uber as a user

From
Geoff Winkless
Date:
On 27 July 2016 at 17:04, Bruce Momjian <bruce@momjian.us> wrote:
Well, their big complaint about binary replication is that a bug can
spread from a master to all slaves, which doesn't happen with statement
level replication.  

​I'm not sure that that makes sense to me. If there's a database bug that occurs when you run a statement on the master, it seems there's a decent chance that that same bug is going to occur when you run the same statement on the slave.

Obviously it depends on the type of bug and how identical the slave is, but statement-level replication certainly doesn't preclude such a bug from propagating.​
 

​Geoff​

Re: Why we lost Uber as a user

From
Vitaly Burovoy
Date:
On 7/28/16, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
> On 27 July 2016 at 17:04, Bruce Momjian <bruce@momjian.us> wrote:
>
>> Well, their big complaint about binary replication is that a bug can
>> spread from a master to all slaves, which doesn't happen with statement
>> level replication.
>
> ​
> I'm not sure that that makes sense to me. If there's a database bug that
> occurs when you run a statement on the master, it seems there's a decent
> chance that that same bug is going to occur when you run the same statement
> on the slave.
>
> Obviously it depends on the type of bug and how identical the slave is, but
> statement-level replication certainly doesn't preclude such a bug from
> propagating.
>
> ​Geoff

Please, read the article first! The bug is about wrong visibility of
tuples after applying WAL at slaves.
For example, you can see two different records selecting from a table
by a primary key (moreover, their PKs are the same, but other columns
differ).

From the article (emphasizing is mine):
The following query illustrates how this bug would affect our users
table example:
SELECT * FROM users WHERE id = 4;
This query would return *TWO* records: ...


And it affected slaves, not master.
Slaves are for decreasing loading to master, if you run all queries
(even) RO at master, why would you (or someone) have so many slaves?

--
Best regards,
Vitaly Burovoy



Re: Why we lost Uber as a user

From
Geoff Winkless
Date:
<p dir="ltr">On 28 Jul 2016 12:19, "Vitaly Burovoy" <<a
href="mailto:vitaly.burovoy@gmail.com">vitaly.burovoy@gmail.com</a>>wrote:<br /> ><br /> > On 7/28/16, Geoff
Winkless<<a href="mailto:pgsqladmin@geoff.dj">pgsqladmin@geoff.dj</a>> wrote:<br /> > > On 27 July 2016 at
17:04,Bruce Momjian <<a href="mailto:bruce@momjian.us">bruce@momjian.us</a>> wrote:<br /> > ><br /> >
>>Well, their big complaint about binary replication is that a bug can<br /> > >> spread from a master
toall slaves, which doesn't happen with statement<br /> > >> level replication.<br /> > ><br /> >
>​<br /> > > I'm not sure that that makes sense to me. If there's a database bug that<br /> > > occurs
whenyou run a statement on the master, it seems there's a decent<br /> > > chance that that same bug is going to
occurwhen you run the same statement<br /> > > on the slave.<br /> > ><br /> > > Obviously it depends
onthe type of bug and how identical the slave is, but<br /> > > statement-level replication certainly doesn't
precludesuch a bug from<br /> > > propagating.<br /> > ><br /> > > ​Geoff<br /> ><br /> >
Please,read the article first! The bug is about wrong visibility of<br /> > tuples after applying WAL at slaves.<br
/>> For example, you can see two different records selecting from a table<br /> > by a primary key (moreover,
theirPKs are the same, but other columns<br /> > differ).<p dir="ltr">I read the article. It affected slaves as well
asthe master.<p dir="ltr">I quote:<br /> "because of the way replication works, this issue has the potential to spread
intoall of the databases in a replication hierarchy"<p dir="ltr">I maintain that this is a nonsense argument.
Especiallysince (as you pointed out and as I missed first time around) the bug actually occurred at different records
ondifferent slaves, so he invalidates his own point.<p dir="ltr">Geoff 

Re: Why we lost Uber as a user

From
pgwhatever
Date:
Statement-Based replication has a lot of problems with it like indeterminate
UDFs.  Here is a link to see them all:
https://dev.mysql.com/doc/refman/5.7/en/replication-sbr-rbr.html#replication-sbr-rbr-sbr-disadvantages



--
View this message in context: http://postgresql.nabble.com/Why-we-lost-Uber-as-a-user-tp5913417p5913750.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Why we lost Uber as a user

From
Merlin Moncure
Date:
On Thu, Jul 28, 2016 at 8:16 AM, pgwhatever <michael@sqlexec.com> wrote:
> Statement-Based replication has a lot of problems with it like indeterminate
> UDFs.  Here is a link to see them all:
> https://dev.mysql.com/doc/refman/5.7/en/replication-sbr-rbr.html#replication-sbr-rbr-sbr-disadvantages

Sure.  It's also incredibly efficient with respect to bandwidth -- so,
if you're application was engineered to work around those problems
it's a huge win.  They could have used pgpool, but I guess the fix was
already in.

Taking a step back, from the outside, it looks like uber:
*) has a very thick middleware, very thin database with respect to
logic and complexity
*) has a very high priority on quick and cheap (in terms of bandwidth)
replication
*) has decided the database needs to be interchangeable
*) is not afraid to make weak or erroneous technical justifications as
a basis of stack selection (the futex vs ipc argument I felt was
particularly awful -- it ignored the fact we use spinlocks)

The very fact that they swapped it out so easily suggests that they
were not utilizing the database as they could have, and a different
technical team might have come to a different result.   Postgres is a
very general system and rewards deep knowledge such that it can
outperform even specialty systems in the hands of a capable developer
(for example, myself).  I'm just now hammering in the final coffin
nails that will get solr swapped out for jsonb backed postgres.

I guess it's fair to say that they felt mysql is closer to what they
felt a database should do out of the box.  That's disappointing, but
life moves on.  The takeaways are:

*) people like different choices of replication mechanics -- statement
level sucks a lot of the time, but not all the time
*) hs/sr simplicity of configuration and operation is a big issue.
it's continually gotten better and still needs to
*) bad QC can cost you customers.   how much regression coverage do we
have of hs/sr?
*) postgres may not be the ideal choice for those who want a thin and
simple database

merlin



Re: Why we lost Uber as a user

From
Vladimir Sitnikov
Date:

>> That's a recipe for runaway table bloat; VACUUM can't do much because
>> there's always some minutes-old transaction hanging around (and SNAPSHOT
>> TOO OLD doesn't really help, we're talking about minutes here), and
>> because of all of the indexes HOT isn't effective.

Just curious: what if PostgreSQL supported index that stores "primary key" (or unique key) instead of tids?
Am I right that kind of index would not suffer from that bloat? I'm assuming the primary key is not updated, thus secondary indices build in that way should be much less prone to bloat when updates land to other columns (even if tid moves, its PK does not change, thus secondary index row could be reused).

If that works, it could reduce index bloat, reduce the amount of WAL (less indices will need be updated). Of course it will make index scan a bit worse, however it looks like at least Uber is fine with that extra cost of index scan.

Does it make sense to implement that kind of index as an access method?

Vladimir

Re: Why we lost Uber as a user

From
Alex Ignatov
Date:

On 28.07.2016 17:53, Vladimir Sitnikov wrote:

>> That's a recipe for runaway table bloat; VACUUM can't do much because
>> there's always some minutes-old transaction hanging around (and SNAPSHOT
>> TOO OLD doesn't really help, we're talking about minutes here), and
>> because of all of the indexes HOT isn't effective.

Just curious: what if PostgreSQL supported index that stores "primary key" (or unique key) instead of tids?
Am I right that kind of index would not suffer from that bloat? I'm assuming the primary key is not updated, thus secondary indices build in that way should be much less prone to bloat when updates land to other columns (even if tid moves, its PK does not change, thus secondary index row could be reused).

If that works, it could reduce index bloat, reduce the amount of WAL (less indices will need be updated). Of course it will make index scan a bit worse, however it looks like at least Uber is fine with that extra cost of index scan.

Does it make sense to implement that kind of index as an access method?

Vladimir

You mean IOT like Oracle have?

Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Why we lost Uber as a user

From
Josh Berkus
Date:
On 07/28/2016 03:58 AM, Geoff Winkless wrote:
> On 27 July 2016 at 17:04, Bruce Momjian <bruce@momjian.us
> <mailto:bruce@momjian.us>>wrote:
>
>     Well, their big complaint about binary replication is that a bug can
>     spread from a master to all slaves, which doesn't happen with statement
>     level replication.
>
>
> ​
> ​I'm not sure that that makes sense to me. If there's a database bug
> that occurs when you run a statement on the master, it seems there's a
> decent chance that that same bug is going to occur when you run the same
> statement on the slave.
>
> Obviously it depends on the type of bug and how identical the slave is,
> but statement-level replication certainly doesn't preclude such a bug
> from propagating.​

That's correct, which is why I ignored that part of their post.

However, we did have issues for a couple of years where replication
accuracy was poorly tested, and did have several bugs associated with
that.  It's not surprising that a few major users got hit hard by those
bugs and decided to switch.

--
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)



Re: Why we lost Uber as a user

From
Jim Nasby
Date:
On 7/28/16 10:05 AM, Alex Ignatov wrote:
>> Just curious: what if PostgreSQL supported index that stores "primary
>> key" (or unique key) instead of tids?
>
> You mean IOT like Oracle have?

IIRC, IOT either stores the table in index order, which is something 
different.

What Alex is proposing is an index method that stores a datum instead of 
a ctid. You would then use that datum to probe a different index to get 
the ctid. Or put simply, you have a PK index that contains ctid's, and a 
bunch of other indexes that contain a PK value instead of ctid's.

I think it's an idea worth pursuing, but I don't see how you can make it 
work with our MVCC system unless we drop the aversion to scanning back 
into an index as part of an update.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461



Re: Why we lost Uber as a user

From
Stephen Frost
Date:
* Jim Nasby (Jim.Nasby@BlueTreble.com) wrote:
> On 7/28/16 10:05 AM, Alex Ignatov wrote:
> >>Just curious: what if PostgreSQL supported index that stores "primary
> >>key" (or unique key) instead of tids?
> >
> >You mean IOT like Oracle have?
>
> IIRC, IOT either stores the table in index order, which is something
> different.

IOT is definitely an interesting idea that I'd like to see us pursue,
but I agree that it's something different.

> What Alex is proposing is an index method that stores a datum
> instead of a ctid. You would then use that datum to probe a
> different index to get the ctid. Or put simply, you have a PK index
> that contains ctid's, and a bunch of other indexes that contain a PK
> value instead of ctid's.

Right, that's the MySQL approach, which has advantages and
disadvantages.

> I think it's an idea worth pursuing, but I don't see how you can
> make it work with our MVCC system unless we drop the aversion to
> scanning back into an index as part of an update.

I'm not terribly excited about the MySQL approach, personally, but I
really like the idea of trying to make HOT updates smarter and allow HOT
updates for indexes which don't include TIDs, as Robert and Alvaro are
discussing.

Another thought that was kicking around in my head related to that is if
we could have indexes that only provide page-level information (similar
to BRIN, but maybe a btree) and which also would allow HOT updates.
Those indexes would typically be used in a bitmap index scan where we're
going to be doing a bitmap heap scan with a recheck, of course, though I
wonder if we could come up with a way to do an in-order bitmap index
scan where we sort the tuples on the page and then perform some kind of
mergejoin recheck (or just pull out whatever the lowest-not-seen each
time we sort the tuples on the page).

All very hand-wavy, of course, and it'd make sense to make the concept
work for BRIN before we consider anything else, but it seems like there
could be a use-case for allowing indexes other than BRIN to be built in
a way that allows HOT updates to happen, thus eliminating the cost of
having to update those indexes when the tuple is changed, in many cases.
Of course, those indexes couldn't be used UNIQUE indexes or used for
primary keys, and adjusting the parameters to a BRIN index you could
possibly get a similar index, but this might allow such an index to
still be usable for index-only scans, which a BRIN index will never be
able to provide.

Thanks!

Stephen

Re: Why we lost Uber as a user

From
Bruce Momjian
Date:
On Fri, Jul 29, 2016 at 10:44:29AM -0400, Stephen Frost wrote:
> Another thought that was kicking around in my head related to that is if
> we could have indexes that only provide page-level information (similar
> to BRIN, but maybe a btree) and which also would allow HOT updates.
> Those indexes would typically be used in a bitmap index scan where we're
> going to be doing a bitmap heap scan with a recheck, of course, though I
> wonder if we could come up with a way to do an in-order bitmap index
> scan where we sort the tuples on the page and then perform some kind of
> mergejoin recheck (or just pull out whatever the lowest-not-seen each
> time we sort the tuples on the page).

So allow HOT updates if the updated row is on the same page, even if the
indexed column was changed, by scanning the page --- got it.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Why we lost Uber as a user

From
Stephen Frost
Date:
* Bruce Momjian (bruce@momjian.us) wrote:
> On Fri, Jul 29, 2016 at 10:44:29AM -0400, Stephen Frost wrote:
> > Another thought that was kicking around in my head related to that is if
> > we could have indexes that only provide page-level information (similar
> > to BRIN, but maybe a btree) and which also would allow HOT updates.
> > Those indexes would typically be used in a bitmap index scan where we're
> > going to be doing a bitmap heap scan with a recheck, of course, though I
> > wonder if we could come up with a way to do an in-order bitmap index
> > scan where we sort the tuples on the page and then perform some kind of
> > mergejoin recheck (or just pull out whatever the lowest-not-seen each
> > time we sort the tuples on the page).
>
> So allow HOT updates if the updated row is on the same page, even if the
> indexed column was changed, by scanning the page --- got it.

The idea I had was to allow creation of indexes which *only* include the
page ID.  Your rephrase seems to imply that we'd have a regular index
but then be able to figure out if a given tuple had any HOT updates
performed on it and, if so, scan the entire page.  As I understand it,
it's more complicated than that because we must involve an index when
updating a tuple in some cases (UNIQUE?) and therefore we don't perform
HOT in the case where any indexed column is being changed.

Of course, this only works if these page-level indexes don't support the
features that prevent HOT updates today.  If we can tell which existing
indexes have been built in a such a way to prevent HOT updates and which
would work with a HOT updated tuple, then perhaps we could change the
HOT code to check that when it's considering if a tuple can be updated
using HOT or not and not have only specific indexes able to support HOT
updated tuples.

This is clearly all hand-wavy, but if the the BRIN indexes could work in
this way then it seems like we should be able to generalize what it is
about BRIN that allows it and provide a way for other kinds of indexes
to support tuples being HOT updated.  It appears to be clearly useful in
some use-cases.  That's really what I was trying to get at.

Thanks!

Stephen

Re: Why we lost Uber as a user

From
Hannu Krosing
Date:
On 07/27/2016 12:07 AM, Tom Lane wrote:
>
>> 4. Now, update that small table 500 times per second.
>> That's a recipe for runaway table bloat; VACUUM can't do much because
>> there's always some minutes-old transaction hanging around (and SNAPSHOT
>> TOO OLD doesn't really help, we're talking about minutes here), and
>> because of all of the indexes HOT isn't effective.
> Hm, I'm not following why this is a disaster.  OK, you have circa 100%
> turnover of the table in the lifespan of the slower transactions, but I'd
> still expect vacuuming to be able to hold the bloat to some small integer
> multiple of the minimum possible table size.  (And if the table is small,
> that's still small.)  I suppose really long transactions (pg_dump?) could
> be pretty disastrous, but there are ways around that, like doing pg_dump
> on a slave.
Is there any theoretical obstacle which would make it impossible to
teach VACUUM not to hold back the whole vacuum horizon, but just
to leave a single transaction alone in case of a long-running
REPEATABLE READ transaction ?

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Ltd




Re: Why we lost Uber as a user

From
Stephen Frost
Date:
* Hannu Krosing (hkrosing@gmail.com) wrote:
> On 07/27/2016 12:07 AM, Tom Lane wrote:
> >
> >> 4. Now, update that small table 500 times per second.
> >> That's a recipe for runaway table bloat; VACUUM can't do much because
> >> there's always some minutes-old transaction hanging around (and SNAPSHOT
> >> TOO OLD doesn't really help, we're talking about minutes here), and
> >> because of all of the indexes HOT isn't effective.
> > Hm, I'm not following why this is a disaster.  OK, you have circa 100%
> > turnover of the table in the lifespan of the slower transactions, but I'd
> > still expect vacuuming to be able to hold the bloat to some small integer
> > multiple of the minimum possible table size.  (And if the table is small,
> > that's still small.)  I suppose really long transactions (pg_dump?) could
> > be pretty disastrous, but there are ways around that, like doing pg_dump
> > on a slave.
> Is there any theoretical obstacle which would make it impossible to
> teach VACUUM not to hold back the whole vacuum horizon, but just
> to leave a single transaction alone in case of a long-running
> REPEATABLE READ transaction ?

I've looked into this a couple of times and I believe it's possible to
calculate what records have to remain available for the long-running
transaction, but it's far from trivial.

I do think that's a direction which we really need to go in, however.
Having a single horizon which is dictated by the oldest running
transaction isn't a tenable solution in environments with a lot of
churn.

Thanks!

Stephen

Re: Why we lost Uber as a user

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Hannu Krosing (hkrosing@gmail.com) wrote:
>> Is there any theoretical obstacle which would make it impossible to
>> teach VACUUM not to hold back the whole vacuum horizon, but just
>> to leave a single transaction alone in case of a long-running
>> REPEATABLE READ transaction ?

> I've looked into this a couple of times and I believe it's possible to
> calculate what records have to remain available for the long-running
> transaction, but it's far from trivial.

I think it'd become a lot easier if we went over to representing snapshots
as LSN positions (and, concomitantly, had an inexpensive way to translate
XIDs to their commit LSNs).  That would mean that

(1) a backend's snapshot state could be fully exposed in PGPROC, at least
up to some small number of active snapshots;

(2) it'd be fairly cheap for VACUUM to detect that a dead tuple's XMIN and
XMAX are either both before or both after each live snapshot.

Someone (Heikki, I think) has been working on this but I've not seen
any patch yet.
        regards, tom lane



Re: Why we lost Uber as a user

From
Alfred Perlstein
Date:
<p><br /><br /><div class="moz-cite-prefix">On 7/28/16 4:39 AM, Geoff Winkless wrote:<br /></div><blockquote
cite="mid:CAEzk6feRoM6wMqKVsHjfyL0+GrEUDicQ+fbTBmqTtdQsSYCBGw@mail.gmail.com"type="cite"><p dir="ltr">On 28 Jul 2016
12:19,"Vitaly Burovoy" <<a href="mailto:vitaly.burovoy@gmail.com"
moz-do-not-send="true">vitaly.burovoy@gmail.com</a>>wrote:<br /> ><br /> > On 7/28/16, Geoff Winkless <<a
href="mailto:pgsqladmin@geoff.dj"moz-do-not-send="true">pgsqladmin@geoff.dj</a>> wrote:<br /> > > On 27 July
2016at 17:04, Bruce Momjian <<a href="mailto:bruce@momjian.us" moz-do-not-send="true">bruce@momjian.us</a>>
wrote:<br/> > ><br /> > >> Well, their big complaint about binary replication is that a bug can<br />
>>> spread from a master to all slaves, which doesn't happen with statement<br /> > >> level
replication.<br/> > ><br /> > > ​<br /> > > I'm not sure that that makes sense to me. If there's a
databasebug that<br /> > > occurs when you run a statement on the master, it seems there's a decent<br /> >
>chance that that same bug is going to occur when you run the same statement<br /> > > on the slave.<br />
>><br /> > > Obviously it depends on the type of bug and how identical the slave is, but<br /> > >
statement-levelreplication certainly doesn't preclude such a bug from<br /> > > propagating.<br /> > ><br
/>> > ​Geoff<br /> ><br /> > Please, read the article first! The bug is about wrong visibility of<br />
>tuples after applying WAL at slaves.<br /> > For example, you can see two different records selecting from a
table<br/> > by a primary key (moreover, their PKs are the same, but other columns<br /> > differ).<p dir="ltr">I
readthe article. It affected slaves as well as the master.<p dir="ltr">I quote:<br /> "because of the way replication
works,this issue has the potential to spread into all of the databases in a replication hierarchy"<p dir="ltr">I
maintainthat this is a nonsense argument. Especially since (as you pointed out and as I missed first time around) the
bugactually occurred at different records on different slaves, so he invalidates his own point.<p
dir="ltr">Geoff</blockquote>Seriously?<br /><br /> There's a valid point here, you're sending over commands at the
blocklevel, effectively "write to disk at this location" versus "update this record based on PK", obviously this has
somedrawbacks that are reason for concern.  Does it validate the move on its own?  NO.  Does it add to the reasons to
moveaway?  Yes, that much is obvious.<br /><br /> Please read this thread:<br /><a class="moz-txt-link-freetext"
href="https://www.reddit.com/r/programming/comments/4vms8x/why_we_lost_uber_as_a_user_postgresql_mailing_list/d5zx82n">https://www.reddit.com/r/programming/comments/4vms8x/why_we_lost_uber_as_a_user_postgresql_mailing_list/d5zx82n</a><br
/><br/> Do I love postgresql?  Yes.  <br /> Have I been bitten by things such as this?  Yes.<br /> Should the community
learnfrom these things and think of ways to avoid it?  Absolutely!<br /><br /> -Alfred  

Re: Why we lost Uber as a user

From
Alfred Perlstein
Date:

On 7/28/16 7:08 AM, Merlin Moncure wrote:
>
> *) postgres may not be the ideal choice for those who want a thin and
> simple database
This is a huge market, addressing it will bring mindshare and more jobs, 
code and braintrust to psql.

-Alfred



Re: Why we lost Uber as a user

From
Alfred Perlstein
Date:

On 7/26/16 9:54 AM, Joshua D. Drake wrote:
> Hello,
>
> The following article is a very good look at some of our limitations 
> and highlights some of the pains many of us have been working "around" 
> since we started using the software.
>
> https://eng.uber.com/mysql-migration/
>
> Specifically:
>
> * Inefficient architecture for writes
> * Inefficient data replication
> * Issues with table corruption
> * Poor replica MVCC support
> * Difficulty upgrading to newer releases
>
> It is a very good read and I encourage our hackers to do so with an 
> open mind.
>
> Sincerely,
>
> JD
>
It was a good read.

Having based a high performance web tracking service as well as a high 
performance security appliance on Postgresql I too have been bitten by 
these issues.

I had a few questions that maybe the folks with core knowledge can answer:

1) Would it be possible to create a "star-like" schema to fix this 
problem?  For example, let's say you have a table that is similar to Uber's:
col0pk, col1, col2, col3, col4, col5

All cols are indexed.
Assuming that updates happen to only 1 column at a time.
Why not figure out some way to encourage or automate the splitting of 
this table into multiple tables that present themselves as a single table?

What I mean is that you would then wind up with the following tables:
table1: col0pk, col1
table2: col0pk, col2
table3: col0pk, col3
table4: col0pk, col4
table5: col0pk, col5

Now when you update "col5" on a row, you only have to update the index 
on table5:col5 and table5:col0pk as opposed to beforehand where you 
would have to update more indecies.  In addition I believe that vacuum 
would be somewhat mitigated as well in this case.

2) Why not have a look at how innodb does its storage, would it be 
possible to do this?

3) For the small-ish table that Uber mentioned, is there a way to "have 
it in memory" however provide some level of sync to disk so that it is 
consistent?

thanks!
-Alfred





Re: Why we lost Uber as a user

From
Geoff Winkless
Date:
On 2 August 2016 at 08:11, Alfred Perlstein <alfred@freebsd.org> wrote:
> On 7/2/16 4:39 AM, Geoff Winkless wrote:
> > I maintain that this is a nonsense argument. Especially since (as you pointed out and as I missed first time
around)the bug actually occurred at different records on different slaves, so he invalidates his own point.
 

> Seriously?

No, I make a habit of spouting off random arguments to a list full of
people whose opinions I massively respect purely for kicks. What do
you think?

> There's a valid point here, you're sending over commands at the block level, effectively "write to disk at this
location"versus "update this record based on PK", obviously this has some drawbacks that are reason for concern.
 

Writing values directly into file offsets is only problematic if
something else has failed that has caused the file to be an inexact
copy. If a different bug occurred that caused the primary key to be
corrupted on the slave (or indeed the master), PK-based updates would
exhibit similar propagation errors.

To reiterate my point, uber's described problem came about because of
a bug. Every software has bugs at some point in its life, to pretend
otherwise is simply naive. I'm not trying to excuse the bug, or to
belittle the impact that such a bug has on data integrity or on uber
or indeed on the reputation of PostgreSQL. While I'm prepared to
accept (because I have a job that requires I spend time on things
other than digging through obscure reddits and mailing lists to
understand more fully the exact cause) that in _this particular
instance_ the bug was propagated because of the replication mechanism
(although I'm still dubious about that, as per my comment above), that
does _not_ preclude other bugs propagating in a statement-based
replication. That's what I said is a nonsense argument, and no-one has
yet explained in what way that's incorrect.

Geoff



Re: Why we lost Uber as a user

From
Amit Kapila
Date:
On Sat, Jul 30, 2016 at 12:06 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Bruce Momjian (bruce@momjian.us) wrote:
>> On Fri, Jul 29, 2016 at 10:44:29AM -0400, Stephen Frost wrote:
>> > Another thought that was kicking around in my head related to that is if
>> > we could have indexes that only provide page-level information (similar
>> > to BRIN, but maybe a btree) and which also would allow HOT updates.
>> > Those indexes would typically be used in a bitmap index scan where we're
>> > going to be doing a bitmap heap scan with a recheck, of course, though I
>> > wonder if we could come up with a way to do an in-order bitmap index
>> > scan where we sort the tuples on the page and then perform some kind of
>> > mergejoin recheck (or just pull out whatever the lowest-not-seen each
>> > time we sort the tuples on the page).
>>
>> So allow HOT updates if the updated row is on the same page, even if the
>> indexed column was changed, by scanning the page --- got it.
>
> The idea I had was to allow creation of indexes which *only* include the
> page ID.  Your rephrase seems to imply that we'd have a regular index
> but then be able to figure out if a given tuple had any HOT updates
> performed on it and, if so, scan the entire page.  As I understand it,
> it's more complicated than that because we must involve an index when
> updating a tuple in some cases (UNIQUE?) and therefore we don't perform
> HOT in the case where any indexed column is being changed.
>

Why we need to add a record in all indexes if only the key
corresponding to one of indexes is updated?  Basically, if the tuple
can fit on same page, why can't we consider it as HOT (or HPT - heap
partial tuple or something like that), unless it updates all the keys
for all the indexes.  Now, we can't consider such tuple versions for
pruning as we do for HOT.  The downside of this could be that we might
need to retain some of the line pointers for more time (as we won't be
able to reuse the line pointer till it is used in any one of the
indexes and those could be reused once we make next non-HOT update).
However, this should allow us not to update the indexes for which the
corresponding column in tuple is not updated.  I think it is a basic
premise that if any index column is updated then the update will be
considered as non-HOT, so there is a good chance that I might be
missing something here.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Why we lost Uber as a user

From
Robert Haas
Date:
On Tue, Aug 2, 2016 at 5:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Why we need to add a record in all indexes if only the key
> corresponding to one of indexes is updated?  Basically, if the tuple
> can fit on same page, why can't we consider it as HOT (or HPT - heap
> partial tuple or something like that), unless it updates all the keys
> for all the indexes.  Now, we can't consider such tuple versions for
> pruning as we do for HOT.  The downside of this could be that we might
> need to retain some of the line pointers for more time (as we won't be
> able to reuse the line pointer till it is used in any one of the
> indexes and those could be reused once we make next non-HOT update).
> However, this should allow us not to update the indexes for which the
> corresponding column in tuple is not updated.  I think it is a basic
> premise that if any index column is updated then the update will be
> considered as non-HOT, so there is a good chance that I might be
> missing something here.

Well, I think that the biggest advantage of a HOT update is the fact
that it enables HOT pruning.  In other words, we're not primarily
trying to minimize index traffic; we're trying to make cleanup of the
heap cheaper.  So this could certainly be done, but I'm not sure it
would buy us enough to be worth the engineering effort involved.

Personally, I think that incremental surgery on our current heap
format to try to fix this is not going to get very far.  If you look
at the history of this, 8.3 was a huge release for timely cleanup of
dead tuple.  There was also significant progress in 8.4 as a result of
5da9da71c44f27ba48fdad08ef263bf70e43e689.   As far as I can recall, we
then made no progress at all in 9.0 - 9.4.  We made a very small
improvement in 9.5 with 94028691609f8e148bd4ce72c46163f018832a5b, but
that's pretty niche.  In 9.6, we have "snapshot too old", which I'd
argue is potentially a large improvement, but it was big and invasive
and will no doubt pose code maintenance hazards in the years to come;
also, many people won't be able to use it or won't realize that they
should use it.  I think it is likely that further incremental
improvements here will be quite hard to find, and the amount of effort
will be large relative to the amount of benefit.  I think we need a
new storage format where the bloat is cleanly separated from the data
rather than intermingled with it; every other major RDMS works that
way.  Perhaps this is a case of "the grass is greener on the other
side of the fence", but I don't think so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why we lost Uber as a user

From
Simon Riggs
Date:
On 2 August 2016 at 15:27, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 2, 2016 at 5:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Why we need to add a record in all indexes if only the key
>> corresponding to one of indexes is updated?  Basically, if the tuple
>> can fit on same page, why can't we consider it as HOT (or HPT - heap
>> partial tuple or something like that), unless it updates all the keys
>> for all the indexes.  Now, we can't consider such tuple versions for
>> pruning as we do for HOT.  The downside of this could be that we might
>> need to retain some of the line pointers for more time (as we won't be
>> able to reuse the line pointer till it is used in any one of the
>> indexes and those could be reused once we make next non-HOT update).
>> However, this should allow us not to update the indexes for which the
>> corresponding column in tuple is not updated.  I think it is a basic
>> premise that if any index column is updated then the update will be
>> considered as non-HOT, so there is a good chance that I might be
>> missing something here.
>
> Well, I think that the biggest advantage of a HOT update is the fact
> that it enables HOT pruning.  In other words, we're not primarily
> trying to minimize index traffic; we're trying to make cleanup of the
> heap cheaper.  So this could certainly be done, but I'm not sure it
> would buy us enough to be worth the engineering effort involved.

(Hi, just back from leave and catching up on emails.)

The above suggested design is something I've been working on for last
few days. In my design I referred to "intermediate root" tuples. I've
got a detailed design for it and it works, yay!... but Pavan has
managed to shoot it down with some accurate observations about it
leading to an annoying accumulation of root pointers and complex logic
to remove them. So I'm not pursuing it further at this stage.

I'm writing up my conclusions around what we should do now, so should
post later today.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Why we lost Uber as a user

From
Alfred Perlstein
Date:

> On Aug 2, 2016, at 2:33 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
>
>> On 2 August 2016 at 08:11, Alfred Perlstein <alfred@freebsd.org> wrote:
>>> On 7/2/16 4:39 AM, Geoff Winkless wrote:
>>> I maintain that this is a nonsense argument. Especially since (as you pointed out and as I missed first time
around)the bug actually occurred at different records on different slaves, so he invalidates his own point. 
>
>> Seriously?
>
> No, I make a habit of spouting off random arguments to a list full of
> people whose opinions I massively respect purely for kicks. What do
> you think?
>
>> There's a valid point here, you're sending over commands at the block level, effectively "write to disk at this
location"versus "update this record based on PK", obviously this has some drawbacks that are reason for concern. 
>
> Writing values directly into file offsets is only problematic if
> something else has failed that has caused the file to be an inexact
> copy. If a different bug occurred that caused the primary key to be
> corrupted on the slave (or indeed the master), PK-based updates would
> exhibit similar propagation errors.
>
> To reiterate my point, uber's described problem came about because of
> a bug. Every software has bugs at some point in its life, to pretend
> otherwise is simply naive. I'm not trying to excuse the bug, or to
> belittle the impact that such a bug has on data integrity or on uber
> or indeed on the reputation of PostgreSQL. While I'm prepared to
> accept (because I have a job that requires I spend time on things
> other than digging through obscure reddits and mailing lists to
> understand more fully the exact cause) that in _this particular
> instance_ the bug was propagated because of the replication mechanism
> (although I'm still dubious about that, as per my comment above), that
> does _not_ preclude other bugs propagating in a statement-based
> replication. That's what I said is a nonsense argument, and no-one has
> yet explained in what way that's incorrect.
>
> Geoff


Geoff,

You are quite technical, my feeling is that you will understand it, however it will need to be a self learned lesson.

-Alfred





Re: Why we lost Uber as a user

From
Robert Haas
Date:
On Tue, Aug 2, 2016 at 3:07 PM, Alfred Perlstein <alfred@freebsd.org> wrote:
> You are quite technical, my feeling is that you will understand it, however it will need to be a self learned
lesson.

I don't know what this is supposed to mean, but I think that Geoff's
point is somewhat valid.  No matter how you replicate data, there is
always the possibility that you will replicate any corruption along
with the data - or that your copy will be unfaithful to the original.
The possible advantage of logical replication rather than physical
replication is that any errors you replicate will be logical errors
rather than physical errors - so if the heap gets out of step with the
indexes on the master, the same problem will not necessarily occur on
the slave.  On the flip side, despite what Uber found in their
environment, physical replication tends to be high-performance because
the replay is dead simple.  Andres and others have done a good job
making our logical decoding facility fast, but I believe it's still
slower than plain old physical replication and probably always will
be, and the trigger-based logical replication solutions are slower
still.  Consequently, I believe that both physical and logical
replication have advantages, and that's why we should support both of
them.  Then, each individual user can make the trade-offs they prefer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why we lost Uber as a user

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Tue, Aug 2, 2016 at 3:07 PM, Alfred Perlstein <alfred@freebsd.org> wrote:
> > You are quite technical, my feeling is that you will understand it, however it will need to be a self learned
lesson.
>
> I don't know what this is supposed to mean, but I think that Geoff's
> point is somewhat valid.  No matter how you replicate data, there is
> always the possibility that you will replicate any corruption along
> with the data - or that your copy will be unfaithful to the original.

I believe what Geoff was specifically getting at is probably best
demonstrated with an example.

Consider a bug in the btree index code which will accept a value but not
store it correctly.

INSERT INTO mytable (indexed_column) VALUES (-1000000000);

/* oops, bug, this value gets stored in the wrong place in the btree */

We happily accept the record and insert it into the btree index, but
that insert is incorrect and results in the btree being corrupted
because some bug doesn't handle such large values correctly.

In such a case, either approach to replication (replicating the query
statement, or replicating the changes to the btree page exactly) would
result in corruption on the replica.

The above represents a bug in *just* the btree side of things (the
physical replication did its job correctly, even though the result is a
corrupted index on the replica).

With physical replication, there is the concern that a bug in *just* the
physical (WAL) side of things could cause corruption.  That is, we
correctly accept and store the value on the primary, but the records
generated to send that data to the replica are incorrect and result in
an invalid state on the replica.

Of course, a bug in the physical side of things which caused corruption
would mean that *crash recovery* would also cause corruption.  As I
understand it, that same concern exists for MySQL, so, moving to logical
replication doesn't actually mean you don't need to worry about bugs in
the crash recovery side of things, assuming you depend on the database
to come back up in a consistent manner after a crash.

Thanks!

Stephen

Re: Why we lost Uber as a user

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> With physical replication, there is the concern that a bug in *just* the
> physical (WAL) side of things could cause corruption.

Right.  But with logical replication, there's the same risk that the
master's state could be fine but a replication bug creates corruption on
the slave.

Assuming that the logical replication works by issuing valid SQL commands
to the slave, one could hope that this sort of "corruption" only extends
to having valid data on the slave that fails to match the master.
But that's still not a good state to be in.  And to the extent that
performance concerns lead the implementation to bypass some levels of the
SQL engine, you can easily lose that guarantee too.

In short, I think Uber's position that logical replication is somehow more
reliable than physical is just wishful thinking.  If anything, my money
would be on the other way around: there's a lot less mechanism that can go
wrong in physical replication.  Which is not to say there aren't good
reasons to use logical replication; I just do not believe that one.
        regards, tom lane



Re: Why we lost Uber as a user

From
Alfred Perlstein
Date:

On 8/2/16 2:14 PM, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
>> With physical replication, there is the concern that a bug in *just* the
>> physical (WAL) side of things could cause corruption.
> Right.  But with logical replication, there's the same risk that the
> master's state could be fine but a replication bug creates corruption on
> the slave.
>
> Assuming that the logical replication works by issuing valid SQL commands
> to the slave, one could hope that this sort of "corruption" only extends
> to having valid data on the slave that fails to match the master.
> But that's still not a good state to be in.  And to the extent that
> performance concerns lead the implementation to bypass some levels of the
> SQL engine, you can easily lose that guarantee too.
>
> In short, I think Uber's position that logical replication is somehow more
> reliable than physical is just wishful thinking.  If anything, my money
> would be on the other way around: there's a lot less mechanism that can go
> wrong in physical replication.  Which is not to say there aren't good
> reasons to use logical replication; I just do not believe that one.
>
>             regards, tom lane
>
>
The reason it can be less catastrophic is that for logical replication 
you may futz up your data, but you are safe from corrupting your entire 
db.  Meaning if an update is missed or doubled that may be addressed by 
a fixup SQL stmt, however if the replication causes a write to the 
entirely wrong place in the db file then you need to "fsck" your db and 
hope that nothing super critical was blown away.

The impact across a cluster is potentially magnified by physical 
replication.

So for instance, let's say there is a bug in the master's write to 
disk.  The logical replication acts as a barrier from that bad write 
going to the slaves.   With bad writes going to slaves then any 
corruption experienced on the master will quickly reach the slaves and 
they too will be corrupted.

With logical replication a bug may be stopped at the replication layer.  
At that point you can resync the slave from the master.

Now in the case of physical replication all your base are belong to zuul 
and you are in a very bad state.

That said with logical replication, who's to say that if the statement 
is replicated to a slave that the slave won't experience the same bug 
and also corrupt itself.

We may be saying the same thing, but still there is something to be said 
for logical replication... also, didnt they show that logical 
replication was faster for some use cases at Uber?

-Alfred








Re: Why we lost Uber as a user

From
Bruce Momjian
Date:
On Tue, Aug  2, 2016 at 07:30:22PM -0700, Alfred Perlstein wrote:
> So for instance, let's say there is a bug in the master's write to disk.
> The logical replication acts as a barrier from that bad write going to the
> slaves.   With bad writes going to slaves then any corruption experienced on
> the master will quickly reach the slaves and they too will be corrupted.
> 
> With logical replication a bug may be stopped at the replication layer.  At
> that point you can resync the slave from the master.
> 
> Now in the case of physical replication all your base are belong to zuul and
> you are in a very bad state.
> 
> That said with logical replication, who's to say that if the statement is
> replicated to a slave that the slave won't experience the same bug and also
> corrupt itself.
> 
> We may be saying the same thing, but still there is something to be said for
> logical replication... also, didnt they show that logical replication was
> faster for some use cases at Uber?

I saw from the Uber article that they weren't going to per-row logical
replication but _statement_ replication, which is very hard to do
because typical SQL doesn't record what concurrent transactions
committed before a new statement's transaction snapshot is taken, and
doesn't record lock order for row updates blocked by concurrent activity
--- both of which affect the final result from the query.

So, for statement replication, it is not a question of whether the code
has bugs, but that the replay is not 100% possible in all cases, unless
you switch to some statement-row-lock hybrid ability.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Why we lost Uber as a user

From
Mark Kirkwood
Date:
On 03/08/16 02:27, Robert Haas wrote:
>
> Personally, I think that incremental surgery on our current heap
> format to try to fix this is not going to get very far.  If you look
> at the history of this, 8.3 was a huge release for timely cleanup of
> dead tuple.  There was also significant progress in 8.4 as a result of
> 5da9da71c44f27ba48fdad08ef263bf70e43e689.   As far as I can recall, we
> then made no progress at all in 9.0 - 9.4.  We made a very small
> improvement in 9.5 with 94028691609f8e148bd4ce72c46163f018832a5b, but
> that's pretty niche.  In 9.6, we have "snapshot too old", which I'd
> argue is potentially a large improvement, but it was big and invasive
> and will no doubt pose code maintenance hazards in the years to come;
> also, many people won't be able to use it or won't realize that they
> should use it.  I think it is likely that further incremental
> improvements here will be quite hard to find, and the amount of effort
> will be large relative to the amount of benefit.  I think we need a
> new storage format where the bloat is cleanly separated from the data
> rather than intermingled with it; every other major RDMS works that
> way.  Perhaps this is a case of "the grass is greener on the other
> side of the fence", but I don't think so.
>
Yeah, I think this is a good summary of the state of play.

The only other new db development to use a non-overwriting design like 
ours that I know of was Jim Starky's Falcon engine for (ironically) 
Mysql 6.0. Not sure if anyone is still progressing that at all now.

I do wonder if Uber could have successfully tamed dead tuple bloat with 
aggressive per-table autovacuum settings (and if in fact they tried), 
but as I think Robert said earlier, it is pretty easy to come up with a 
highly update (or insert + delete) workload that makes for a pretty ugly 
bloat component even with real aggressive autovacuuming.

Cheers

Mark




Re: Why we lost Uber as a user

From
Simon Riggs
Date:
On 29 July 2016 at 15:44, Stephen Frost <sfrost@snowman.net> wrote:

> All very hand-wavy, of course, and it'd make sense to make the concept
> work for BRIN before we consider anything else, but it seems like there
> could be a use-case for allowing indexes other than BRIN to be built in
> a way that allows HOT updates to happen, thus eliminating the cost of
> having to update those indexes when the tuple is changed, in many cases.
> Of course, those indexes couldn't be used UNIQUE indexes or used for
> primary keys, and adjusting the parameters to a BRIN index you could
> possibly get a similar index, but this might allow such an index to
> still be usable for index-only scans, which a BRIN index will never be
> able to provide.

This idea is vaguely similar to the concepts I've been working on,
based on earlier work in 2007.

I'm starting a new post for a full discussion.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Why we lost Uber as a user

From
Greg Stark
Date:
On Wed, Aug 3, 2016 at 3:30 AM, Alfred Perlstein <alfred@freebsd.org> wrote:
> We may be saying the same thing, but still there is something to be said for
> logical replication... also, didnt they show that logical replication was
> faster for some use cases at Uber?

There is certainly something to be said for logical replication just
as there is something to be said for having regular pg_dumps which are
logical exports of your database. But neither is a substitute for
having real backups or a real standby database. They serve different
purposes and solve different problems.

But when you have a hardware failure or physical disaster the last
thing you want to be doing is failing over to a different database
that may or may not have the same data or same behaviour as your
former primary. You want to switch over to a standby that is as near
as possibly byte for byte identical and will behave exactly the same.
If there was a bug in your primary the last time you want to find out
about it and have to be dealing with fixing it is when you have a
disaster in your primary and need to be back up asap.

Honestly the take-away I see in the Uber story is that they apparently
had nobody on staff that was on -hackers or apparently even -general
and tried to go it alone rather than involve experts from outside
their company. As a result they misdiagnosed their problems based on
prejudices seeing what they expected to see rather than what the real
problem was.

-- 
greg



Re: Why we lost Uber as a user

From
Bruce Momjian
Date:
On Tue, Aug  2, 2016 at 10:33:15PM -0400, Bruce Momjian wrote:
> I saw from the Uber article that they weren't going to per-row logical
> replication but _statement_ replication, which is very hard to do
> because typical SQL doesn't record what concurrent transactions
> committed before a new statement's transaction snapshot is taken, and
> doesn't record lock order for row updates blocked by concurrent activity
> --- both of which affect the final result from the query.
> 
> So, for statement replication, it is not a question of whether the code
> has bugs, but that the replay is not 100% possible in all cases, unless
> you switch to some statement-row-lock hybrid ability.

Oh, and one more problem with statement-level replication is that the
overhead of statement replay is high, as high as it was on the master. 
That leaves minimal server resources left to handle read-only workloads
on the slave.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Why we lost Uber as a user

From
Craig Ringer
Date:
On 3 August 2016 at 05:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:
 

In short, I think Uber's position that logical replication is somehow more
reliable than physical is just wishful thinking.  If anything, my money
would be on the other way around: there's a lot less mechanism that can go
wrong in physical replication.

Particularly since they aren't using row-based logical replication, but - it seems - statement based replication. We all know the problems there.
 

 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Why we lost Uber as a user

From
Alfred Perlstein
Date:

> On Aug 3, 2016, at 3:29 AM, Greg Stark <stark@mit.edu> wrote:
>
>>
>
> Honestly the take-away I see in the Uber story is that they apparently
> had nobody on staff that was on -hackers or apparently even -general
> and tried to go it alone rather than involve experts from outside
> their company. As a result they misdiagnosed their problems based on
> prejudices seeing what they expected to see rather than what the real
> problem was.
>

+1 very true.

At the same time there are some lessons to be learned. At the very least putting in big bold letters where to come for
helpis one.  






Re: Why we lost Uber as a user

From
Kevin Grittner
Date:
On Wed, Aug 3, 2016 at 8:58 AM, Alfred Perlstein <alfred@freebsd.org> wrote:
> On Aug 3, 2016, at 3:29 AM, Greg Stark <stark@mit.edu> wrote:
>
>> Honestly the take-away I see in the Uber story is that they apparently
>> had nobody on staff that was on -hackers or apparently even -general
>> and tried to go it alone rather than involve experts from outside
>> their company. As a result they misdiagnosed their problems based on
>> prejudices seeing what they expected to see rather than what the real
>> problem was.
>
> +1 very true.
>
> At the same time there are some lessons to be learned. At the
> very least putting in big bold letters where to come for help is
> one.

+1

My initial experience with PostgreSQL would have been entirely
different had I not found the community lists and benefited from
the assistance and collective wisdom found on them.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why we lost Uber as a user

From
Geoff Winkless
Date:
On 3 August 2016 at 15:04, Kevin Grittner <kgrittn@gmail.com> wrote:
> My initial experience with PostgreSQL would have been entirely
> different had I not found the community lists and benefited from
> the assistance and collective wisdom found on them.

The top non-sponsored link on google for "postgres support" takes you
straight to a page with a link to the mailing lists. I'm not sure that
not being able to find them was a problem.

I can well imagine that uber wouldn't have wanted to publicise their
problems (and so wouldn't have used a mailing list anyway); obviously
I've no way of knowing if they contacted any of the support companies
in the professional services page - I assume that professional
courtesy (and/or NDAs!) would preclude anyone from posting such here
anyway.

The problem with the professional services page is that the list of
companies is very dry, but it might be difficult to improve: as a
community it might be considered unreasonable to promote one over the
other; however if I had to go searching for professional support (and
hadn't seen the level of interaction that some of those companies'
employees provide on the mailing lists) I would have no clear idea
where to start.

Perhaps listing those companies that provide employment for some of
the core developers at the top (and explaining so) might be
acceptable? (or maybe not just core? you get the idea though). Maybe a
separate section for support companies versus hosts?

Geoff



Re: Why we lost Uber as a user

From
Robert Haas
Date:
On Tue, Aug 2, 2016 at 5:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Stephen Frost <sfrost@snowman.net> writes:
>> With physical replication, there is the concern that a bug in *just* the
>> physical (WAL) side of things could cause corruption.
>
> Right.  But with logical replication, there's the same risk that the
> master's state could be fine but a replication bug creates corruption on
> the slave.
>
> Assuming that the logical replication works by issuing valid SQL commands
> to the slave, one could hope that this sort of "corruption" only extends
> to having valid data on the slave that fails to match the master.
> But that's still not a good state to be in.  And to the extent that
> performance concerns lead the implementation to bypass some levels of the
> SQL engine, you can easily lose that guarantee too.
>
> In short, I think Uber's position that logical replication is somehow more
> reliable than physical is just wishful thinking.  If anything, my money
> would be on the other way around: there's a lot less mechanism that can go
> wrong in physical replication.  Which is not to say there aren't good
> reasons to use logical replication; I just do not believe that one.

I don't think they are saying that logical replication is more
reliable than physical replication, nor do I believe that to be true.
I think they are saying that if logical corruption happens, you can
fix it by typing SQL statements to UPDATE, INSERT, or DELETE the
affected rows, whereas if physical corruption happens, there's no
equally clear path to recovery.  If an index is damaged, you can
recreate it; if a heap page is damaged such that you can no longer
scan the table, you're going to need expert assistance.

And I think there's some point to that.  I agree with the general
sentiment that they could have gotten further and been more successful
with PostgreSQL if they had some expert advice, but I think it's
indisputable that recovering a physically corrupted database is
generally a lot more painful than one where you only have to fix up
some damaged data.  Whether we really have data-corrupting WAL-replay
bugs sufficiently frequently to make this an ongoing issue rather than
a one-time event is also debatable, but nonetheless I don't think
their point is completely invalid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why we lost Uber as a user

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I don't think they are saying that logical replication is more
> reliable than physical replication, nor do I believe that to be true.
> I think they are saying that if logical corruption happens, you can
> fix it by typing SQL statements to UPDATE, INSERT, or DELETE the
> affected rows, whereas if physical corruption happens, there's no
> equally clear path to recovery.

Well, that's not an entirely unreasonable point, but I dispute the
implication that it makes recovery from corruption an easy thing to do.
How are you going to know what SQL statements to issue?  If the master
database is changing 24x7, how are you going to keep up with that?

I think the realistic answer if you suffer replication-induced corruption
is usually going to be "re-clone that slave", and logical rep doesn't
really offer much gain in that.
        regards, tom lane



Re: Why we lost Uber as a user

From
"Joshua D. Drake"
Date:
On 08/03/2016 11:23 AM, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I don't think they are saying that logical replication is more
>> reliable than physical replication, nor do I believe that to be true.
>> I think they are saying that if logical corruption happens, you can
>> fix it by typing SQL statements to UPDATE, INSERT, or DELETE the
>> affected rows, whereas if physical corruption happens, there's no
>> equally clear path to recovery.
>
> Well, that's not an entirely unreasonable point, but I dispute the
> implication that it makes recovery from corruption an easy thing to do.
> How are you going to know what SQL statements to issue?  If the master
> database is changing 24x7, how are you going to keep up with that?
>
> I think the realistic answer if you suffer replication-induced corruption
> is usually going to be "re-clone that slave", and logical rep doesn't
> really offer much gain in that.

Yes, it actually does. The ability to unsubscribe a set of tables, 
truncate them and then resubscribe them is vastly superior to having to 
take a base backup.

JD

>
>             regards, tom lane
>
>


-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.



Re: Why we lost Uber as a user

From
Robert Haas
Date:
On Wed, Aug 3, 2016 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I don't think they are saying that logical replication is more
>> reliable than physical replication, nor do I believe that to be true.
>> I think they are saying that if logical corruption happens, you can
>> fix it by typing SQL statements to UPDATE, INSERT, or DELETE the
>> affected rows, whereas if physical corruption happens, there's no
>> equally clear path to recovery.
>
> Well, that's not an entirely unreasonable point, but I dispute the
> implication that it makes recovery from corruption an easy thing to do.
> How are you going to know what SQL statements to issue?  If the master
> database is changing 24x7, how are you going to keep up with that?

I think in many cases people fix their data using business logic.  For
example, suppose your database goes down and you have to run
pg_resetxlog to get it back up.  You dump-and-restore, as one does,
and find that you can't rebuild one of your unique indexes because
there are now two records with that same PK.  Well, what you do is you
look at them and judge which one has the correct data, often the one
that looks more complete or the one with the newer timestamp.  Or,
maybe you need to merge them somehow.  In my experience helping users
through problems of this type, once you explain the problem to the
user and tell them they have to square it on their end, the support
call ends.  The user may not always be entirely thrilled about having
to, say, validate a problematic record against external sources of
truth, but they usually know how to do it.  Database bugs aren't the
only way that databases become inaccurate.  If the database that they
use to keep track of land ownership in the jurisdiction where I live
says that two different people own the same piece of property,
somewhere there is a paper deed in a filing cabinet.  Fishing that out
to understand what happened may not be fun, but a DBA can explain that
problem to other people in the organization and those people can get
it fixed.  It's a problem, but it's fixable.

On the other hand, if a heap tuple contains invalid infomask bits that
cause an error every time you read the page (this actually happened to
an EnterpriseDB customer!), the DBA can't tell other people how to fix
it and can't fix it personally either.  Instead, the DBA calls me.
While I try to figure out what happened and solve the problem, every
sequential scan on that table fails, so the customer is basically
down.  In contrast, in the logical corruption scenario, one record
might be wrong, but basically everything is still working.  So it's a
difference between a problem that the DBA can work with coworkers to
fix while the system is up, and a problem that the DBA can't fix and
the system is meanwhile down.  That's a big difference.

> I think the realistic answer if you suffer replication-induced corruption
> is usually going to be "re-clone that slave", and logical rep doesn't
> really offer much gain in that.

If you're using multi-master replication, the notion of what's a slave
gets a bit fuzzy, but, apart from that, yes, this is often the
solution.  However, even here, logical replication can be better.
Given the right tools, I can fix up the slave incrementally, comparing
it to the master row by row and updating anything that's wrong.  If I
have to rebuild a physical master, I'm offline.  The difference
doesn't matter if the slave is so badly corrupted that it's unusable,
but it's very common for corruption to involve only a handful of
records, and many users not unreasonably prefer a database with a
couple of corrupted records to one which is totally down.  "Hey, the
payroll record for that Tom Lane guy is messed up, don't cut his
paycheck until we get that straightened out."  "OK, no problem."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why we lost Uber as a user

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> On 08/03/2016 11:23 AM, Tom Lane wrote:
>> I think the realistic answer if you suffer replication-induced corruption
>> is usually going to be "re-clone that slave", and logical rep doesn't
>> really offer much gain in that.

> Yes, it actually does. The ability to unsubscribe a set of tables, 
> truncate them and then resubscribe them is vastly superior to having to 
> take a base backup.

True, *if* you can circumscribe the corruption to a relatively small
part of your database, logical rep might provide more support for a
partial re-clone.
        regards, tom lane



Re: Why we lost Uber as a user

From
Kevin Grittner
Date:
On Wed, Aug 3, 2016 at 2:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> On 08/03/2016 11:23 AM, Tom Lane wrote:
>>> I think the realistic answer if you suffer replication-induced corruption
>>> is usually going to be "re-clone that slave", and logical rep doesn't
>>> really offer much gain in that.
>
>> Yes, it actually does. The ability to unsubscribe a set of tables,
>> truncate them and then resubscribe them is vastly superior to having to
>> take a base backup.
>
> True, *if* you can circumscribe the corruption to a relatively small
> part of your database, logical rep might provide more support for a
> partial re-clone.

When I worked with Wisconsin Courts to migrate their databases to
PostgreSQL, we had a DBMS-agnostic logical replication system, and
we had a compare program that could be run off-hours as well as
having that be a background activity for the replication software
to work on during idle time.  Either way. a range of rows based on
primary key was read on each side and hashed, the hashes compared,
and if they didn't match there was a column-by-column compare for
each row in the range, with differences listed.  This is how we
discovered issues like the non-standard handling of backslash
mangling our data.

Personally, I can't imagine running logical replication of
supposedly matching sets of data without something equivalent.

Certainly, the courts had source documents to use for resolving any
question of the correct value on a mismatch, and I would imagine
that many environments would.  If you have a meaningful primary key
(like a court case number, by which the file folder is physically
located), seeing the different values for a specific column in a
specific row makes fixes pretty straightforward.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why we lost Uber as a user

From
Torsten Zuehlsdorff
Date:

On 03.08.2016 21:05, Robert Haas wrote:
> On Wed, Aug 3, 2016 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> I don't think they are saying that logical replication is more
>>> reliable than physical replication, nor do I believe that to be true.
>>> I think they are saying that if logical corruption happens, you can
>>> fix it by typing SQL statements to UPDATE, INSERT, or DELETE the
>>> affected rows, whereas if physical corruption happens, there's no
>>> equally clear path to recovery.
>>
>> Well, that's not an entirely unreasonable point, but I dispute the
>> implication that it makes recovery from corruption an easy thing to do.
>> How are you going to know what SQL statements to issue?  If the master
>> database is changing 24x7, how are you going to keep up with that?
>
> I think in many cases people fix their data using business logic.  For
> example, suppose your database goes down and you have to run
> pg_resetxlog to get it back up.  You dump-and-restore, as one does,
> and find that you can't rebuild one of your unique indexes because
> there are now two records with that same PK.  Well, what you do is you
> look at them and judge which one has the correct data, often the one
> that looks more complete or the one with the newer timestamp.  Or,
> maybe you need to merge them somehow.  In my experience helping users
> through problems of this type, once you explain the problem to the
> user and tell them they have to square it on their end, the support
> call ends.  The user may not always be entirely thrilled about having
> to, say, validate a problematic record against external sources of
> truth, but they usually know how to do it.  Database bugs aren't the
> only way that databases become inaccurate.  If the database that they
> use to keep track of land ownership in the jurisdiction where I live
> says that two different people own the same piece of property,
> somewhere there is a paper deed in a filing cabinet.  Fishing that out
> to understand what happened may not be fun, but a DBA can explain that
> problem to other people in the organization and those people can get
> it fixed.  It's a problem, but it's fixable.
>
> On the other hand, if a heap tuple contains invalid infomask bits that
> cause an error every time you read the page (this actually happened to
> an EnterpriseDB customer!), the DBA can't tell other people how to fix
> it and can't fix it personally either.  Instead, the DBA calls me.

After reading this statement the ZFS filesystem pops into my mind. It 
has protection build in against various problems (data degradation, 
current spikes, phantom writes, etc).

For me this raises two questions:

1) would the usage of ZFS prevent such errors?

My feeling would say yes, but i have no idea about how a invalid 
infomask bit could occur.

2) would it be possible to add such prevention to PostgreSQL

I know this could add a massive overhead, but it its optional this could 
be a fine thing?

Greetings,
Torsten



Re: Why we lost Uber as a user

From
Alfred Perlstein
Date:

On 8/4/16 2:00 AM, Torsten Zuehlsdorff wrote:
>
>
> On 03.08.2016 21:05, Robert Haas wrote:
>> On Wed, Aug 3, 2016 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> I don't think they are saying that logical replication is more
>>>> reliable than physical replication, nor do I believe that to be true.
>>>> I think they are saying that if logical corruption happens, you can
>>>> fix it by typing SQL statements to UPDATE, INSERT, or DELETE the
>>>> affected rows, whereas if physical corruption happens, there's no
>>>> equally clear path to recovery.
>>>
>>> Well, that's not an entirely unreasonable point, but I dispute the
>>> implication that it makes recovery from corruption an easy thing to do.
>>> How are you going to know what SQL statements to issue?  If the master
>>> database is changing 24x7, how are you going to keep up with that?
>>
>> I think in many cases people fix their data using business logic.  For
>> example, suppose your database goes down and you have to run
>> pg_resetxlog to get it back up.  You dump-and-restore, as one does,
>> and find that you can't rebuild one of your unique indexes because
>> there are now two records with that same PK.  Well, what you do is you
>> look at them and judge which one has the correct data, often the one
>> that looks more complete or the one with the newer timestamp. Or,
>> maybe you need to merge them somehow.  In my experience helping users
>> through problems of this type, once you explain the problem to the
>> user and tell them they have to square it on their end, the support
>> call ends.  The user may not always be entirely thrilled about having
>> to, say, validate a problematic record against external sources of
>> truth, but they usually know how to do it.  Database bugs aren't the
>> only way that databases become inaccurate.  If the database that they
>> use to keep track of land ownership in the jurisdiction where I live
>> says that two different people own the same piece of property,
>> somewhere there is a paper deed in a filing cabinet.  Fishing that out
>> to understand what happened may not be fun, but a DBA can explain that
>> problem to other people in the organization and those people can get
>> it fixed.  It's a problem, but it's fixable.
>>
>> On the other hand, if a heap tuple contains invalid infomask bits that
>> cause an error every time you read the page (this actually happened to
>> an EnterpriseDB customer!), the DBA can't tell other people how to fix
>> it and can't fix it personally either.  Instead, the DBA calls me.
>
> After reading this statement the ZFS filesystem pops into my mind. It 
> has protection build in against various problems (data degradation, 
> current spikes, phantom writes, etc).
>
> For me this raises two questions:
>
> 1) would the usage of ZFS prevent such errors?
>
> My feeling would say yes, but i have no idea about how a invalid 
> infomask bit could occur.
>
> 2) would it be possible to add such prevention to PostgreSQL
>
> I know this could add a massive overhead, but it its optional this 
> could be a fine thing?
Postgresql is very "zfs-like" in its internals.  The problem was a bug 
in postgresql that caused it to just write data to the wrong place.

Some vendors use ZFS under databases to provide very cool services such 
as backup snapshots, test snapshots and other such uses.  I think Joyent 
is one such vendor but I'm not 100% sure.

-Alfred



Re: Why we lost Uber as a user

From
Alfred Perlstein
Date:

On 8/2/16 10:02 PM, Mark Kirkwood wrote:
> On 03/08/16 02:27, Robert Haas wrote:
>>
>> Personally, I think that incremental surgery on our current heap
>> format to try to fix this is not going to get very far.  If you look
>> at the history of this, 8.3 was a huge release for timely cleanup of
>> dead tuple.  There was also significant progress in 8.4 as a result of
>> 5da9da71c44f27ba48fdad08ef263bf70e43e689.   As far as I can recall, we
>> then made no progress at all in 9.0 - 9.4.  We made a very small
>> improvement in 9.5 with 94028691609f8e148bd4ce72c46163f018832a5b, but
>> that's pretty niche.  In 9.6, we have "snapshot too old", which I'd
>> argue is potentially a large improvement, but it was big and invasive
>> and will no doubt pose code maintenance hazards in the years to come;
>> also, many people won't be able to use it or won't realize that they
>> should use it.  I think it is likely that further incremental
>> improvements here will be quite hard to find, and the amount of effort
>> will be large relative to the amount of benefit.  I think we need a
>> new storage format where the bloat is cleanly separated from the data
>> rather than intermingled with it; every other major RDMS works that
>> way.  Perhaps this is a case of "the grass is greener on the other
>> side of the fence", but I don't think so.
>>
> Yeah, I think this is a good summary of the state of play.
>
> The only other new db development to use a non-overwriting design like 
> ours that I know of was Jim Starky's Falcon engine for (ironically) 
> Mysql 6.0. Not sure if anyone is still progressing that at all now.
>
> I do wonder if Uber could have successfully tamed dead tuple bloat 
> with aggressive per-table autovacuum settings (and if in fact they 
> tried), but as I think Robert said earlier, it is pretty easy to come 
> up with a highly update (or insert + delete) workload that makes for a 
> pretty ugly bloat component even with real aggressive autovacuuming.
I also wonder if they had used "star schema" which to my understanding 
would mean multiple tables to replace the single-table that has multiple 
indecies to work around the write amplification problem in postgresql.

>
> Cheers
>
> Mark
>
>
>




Re: Why we lost Uber as a user

From
Alfred Perlstein
Date:

On 8/3/16 3:29 AM, Greg Stark wrote:
>
> Honestly the take-away I see in the Uber story is that they apparently
> had nobody on staff that was on -hackers or apparently even -general
> and tried to go it alone rather than involve experts from outside
> their company. As a result they misdiagnosed their problems based on
> prejudices seeing what they expected to see rather than what the real
> problem was.
>
Agree strongly, but there are still lessons to be learned on the psql side.

-Alfred



Re: Why we lost Uber as a user

From
Jim Nasby
Date:
Something I didn't see mentioned that I think is a critical point: last 
I looked, HOT standby (and presumably SR) replays full page writes. That 
means that *any* kind of corruption on the master is *guaranteed* to 
replicate to the slave the next time that block is touched. That's 
completely the opposite of trigger-based replication.

On 8/3/16 3:51 PM, Kevin Grittner wrote:
> Personally, I can't imagine running logical replication of
> supposedly matching sets of data without something equivalent.

I think it depends heavily on the replication solution. I ran londiste 
for 6+ years with no major issues, but of course there was at least one 
other major company running that. I also took the time to completely 
read all the source code; something that's a reasonable prospect with a 
few thousand lines of python. For streaming rep it's difficult just to 
draw the line at where the code is.

Ultimately, people really need to understand the trade-offs to the 
different solutions so they can make an informed decision on which ones 
(yes, plural) they want to use. The same can be said about pg_upgrade vs 
something else, and the different ways of doing backups.

Something I think a lot of folks fail to understand is the value of 
having a system that has simple technology in the mix. Keeping something 
like londiste running has a non-zero cost, but the day you discover 
corruption has replicated through your entire infrastructure you'll 
probably be REALLY happy you have it. Similarly, I always encourage 
people to run a weekly or monthly pg_dump if it's at all feasible... 
just to be safe.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461



Re: Why we lost Uber as a user

From
Craig Ringer
Date:
On 17 August 2016 at 08:36, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
Something I didn't see mentioned that I think is a critical point: last I looked, HOT standby (and presumably SR) replays full page writes.

Yes, that's right, all WAL-based physical replication replays FPWs.

We could, at the cost of increased WAL size, retain both the original WAL buffer that triggered the FPW and the FPW page image. That's what wal_level = logical does in some cases. I'm not sure it's that compelling though, it just introduces another redo path that can go wrong.
 
 
Ultimately, people really need to understand the trade-offs to the different solutions so they can make an informed decision on which ones (yes, plural) they want to use. The same can be said about pg_upgrade vs something else, and the different ways of doing backups.

Right.

It's really bugging me that people are talking about "statement based" replication in MySQL as if it's just sending SQL text around. MySQL's statemnet based replication is a lot smarter than that, and in the actually-works-properly form it's a hybrid of row and statement based replication ("MIXED" mode). As I understand it it lobs around something closer to parsetrees with some values pre-computed rather than SQL text where possible. It stores some computed values of volatile functions in the binlog and reads them from there rather than computing them again when running the statement on replicas, which is why AUTO_INCREMENT etc works. It also falls back to row based replication where necessary for correctness. Even then it has a significant list of caveats, but it's pretty damn impressive. I didn't realise how clever the hybrid system was until recently.

I can see it being desirable to do something like that eventually as an optimisation to logical decoding based replication. Where we can show that the statement is safe or make it safe by doing things like evaluating and substituting volatile function calls, xlog a modified parsetree with oids changed to qualified object names etc, send that when decoding, and execute that on the downstream(s). If there's something we can't show to be safe then replay the logical rows instead. That's way down the track though; I think it's more important to focus on completing logical row-based replication to the point where we handle table rewrites seamlessly and it "just works" first.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Why we lost Uber as a user

From
Greg Stark
Date:
On Wed, Aug 17, 2016 at 1:36 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> Something I didn't see mentioned that I think is a critical point: last I
> looked, HOT standby (and presumably SR) replays full page writes. That means
> that *any* kind of corruption on the master is *guaranteed* to replicate to
> the slave the next time that block is touched. That's completely the
> opposite of trigger-based replication.

Yes, this is exactly what it should be doing and exactly why it's
useful. Physical replication accurately replicates the data from the
master including "corruption" whereas a logical replication system
will not, causing divergence and possible issues during a failover.

Picture yourself as Delta, you have a fire in your data centre and go
to fail over to your secondary site. Your DBAs inform you that the
secondary site has "fixed" some corruption that you were unaware of
and wasn't causing any issues and now, in the middle of the business
crisis, is when you're going to need to spend time identifying and
repairing the problem because your business logic has suddenly started
running into problems.

Physical replication tries to solve the same use cases as physical
backups. They both provide you with exactly what you had prior to the
recovery. No more or less. That's what you want when recovering from a
disaster.

-- 
greg



Re: Why we lost Uber as a user

From
Bruce Momjian
Date:
On Wed, Aug 17, 2016 at 01:27:18PM +0800, Craig Ringer wrote:
> It's really bugging me that people are talking about "statement based"
> replication in MySQL as if it's just sending SQL text around. MySQL's statemnet
> based replication is a lot smarter than that, and in the
> actually-works-properly form it's a hybrid of row and statement based
> replication ("MIXED" mode). As I understand it it lobs around something closer
> to parsetrees with some values pre-computed rather than SQL text where
> possible. It stores some computed values of volatile functions in the binlog
> and reads them from there rather than computing them again when running the
> statement on replicas, which is why AUTO_INCREMENT etc works. It also falls
> back to row based replication where necessary for correctness. Even then it has
> a significant list of caveats, but it's pretty damn impressive. I didn't
> realise how clever the hybrid system was until recently.
> 
> I can see it being desirable to do something like that eventually as an
> optimisation to logical decoding based replication. Where we can show that the
> statement is safe or make it safe by doing things like evaluating and
> substituting volatile function calls, xlog a modified parsetree with oids
> changed to qualified object names etc, send that when decoding, and execute
> that on the downstream(s). If there's something we can't show to be safe then
> replay the logical rows instead. That's way down the track though; I think it's
> more important to focus on completing logical row-based replication to the
> point where we handle table rewrites seamlessly and it "just works" first.

That was very interesting, and good to know.  I assume it also covers
concurrent activity issues which I wrote about in this thread, e.g.

> I saw from the Uber article that they weren't going to per-row logical
> replication but _statement_ replication, which is very hard to do
> because typical SQL doesn't record what concurrent transactions
> committed before a new statement's transaction snapshot is taken, and
> doesn't record lock order for row updates blocked by concurrent activity
> --- both of which affect the final result from the query.

I assume they can do SQL-level replication when there is no other
concurrent activity on the table, and row-based in other cases?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Why we lost Uber as a user

From
Craig Ringer
Date:
On 17 August 2016 at 21:35, Bruce Momjian <bruce@momjian.us> wrote:
 

> I saw from the Uber article that they weren't going to per-row logical
> replication but _statement_ replication, which is very hard to do
> because typical SQL doesn't record what concurrent transactions
> committed before a new statement's transaction snapshot is taken, and
> doesn't record lock order for row updates blocked by concurrent activity
> --- both of which affect the final result from the query.

I assume they can do SQL-level replication when there is no other
concurrent activity on the table, and row-based in other cases?

I don't know, but wouldn't want to assume that. A quick search suggests they probably define that away as nondeterministic behaviour that's allowed to cause master/replica differences, but no time to look deeply.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Why we lost Uber as a user

From
Simon Riggs
Date:
On 17 August 2016 at 12:19, Greg Stark <stark@mit.edu> wrote:
> On Wed, Aug 17, 2016 at 1:36 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> Something I didn't see mentioned that I think is a critical point: last I
>> looked, HOT standby (and presumably SR) replays full page writes. That means
>> that *any* kind of corruption on the master is *guaranteed* to replicate to
>> the slave the next time that block is touched. That's completely the
>> opposite of trigger-based replication.
>
> Yes, this is exactly what it should be doing and exactly why it's
> useful. Physical replication accurately replicates the data from the
> master including "corruption" whereas a logical replication system
> will not, causing divergence and possible issues during a failover.

Yay! Completely agree.

Physical replication, as used by DRBD and all other block-level HA
solutions, and also used by other databases, such as Oracle.

Corruption on the master would often cause errors that would prevent
writes and therefore those changes wouldn't even be made, let alone be
replicated.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Why we lost Uber as a user

From
Jim Nasby
Date:
On 8/17/16 2:51 PM, Simon Riggs wrote:
> On 17 August 2016 at 12:19, Greg Stark <stark@mit.edu> wrote:
>> On Wed, Aug 17, 2016 at 1:36 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>> Something I didn't see mentioned that I think is a critical point: last I
>>> looked, HOT standby (and presumably SR) replays full page writes. That means
>>> that *any* kind of corruption on the master is *guaranteed* to replicate to
>>> the slave the next time that block is touched. That's completely the
>>> opposite of trigger-based replication.
>>
>> Yes, this is exactly what it should be doing and exactly why it's
>> useful. Physical replication accurately replicates the data from the
>> master including "corruption" whereas a logical replication system
>> will not, causing divergence and possible issues during a failover.
>
> Yay! Completely agree.
>
> Physical replication, as used by DRBD and all other block-level HA
> solutions, and also used by other databases, such as Oracle.
>
> Corruption on the master would often cause errors that would prevent
> writes and therefore those changes wouldn't even be made, let alone be
> replicated.

My experience has been that you discover corruption after it's already 
safely on disk, and more than once I've been able to recover by using 
data on a londiste replica.

As I said originally, it's critical to understand the different 
solutions and the pros and cons of each. There is no magic bullet.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461



Re: Why we lost Uber as a user

From
Merlin Moncure
Date:
On Wed, Aug 17, 2016 at 5:18 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 8/17/16 2:51 PM, Simon Riggs wrote:
>> On 17 August 2016 at 12:19, Greg Stark <stark@mit.edu> wrote:
>>> Yes, this is exactly what it should be doing and exactly why it's
>>> useful. Physical replication accurately replicates the data from the
>>> master including "corruption" whereas a logical replication system
>>> will not, causing divergence and possible issues during a failover.
>>
>>
>> Yay! Completely agree.
>>
>> Physical replication, as used by DRBD and all other block-level HA
>> solutions, and also used by other databases, such as Oracle.
>>
>> Corruption on the master would often cause errors that would prevent
>> writes and therefore those changes wouldn't even be made, let alone be
>> replicated.
>
>
> My experience has been that you discover corruption after it's already
> safely on disk, and more than once I've been able to recover by using data
> on a londiste replica.
>
> As I said originally, it's critical to understand the different solutions
> and the pros and cons of each. There is no magic bullet.

Data point: in the half or so cases I've experienced corruption on
replicated systems, in all cases but one the standby was clean.  The
'unclean' case actually 8.2 warm standby; the source of the corruption
was a very significant bug where prepared statements would write back
corrupted data if the table definitions changed under the statement
(fixed in 8.3).  In that particular case the corruption was very
unfortunately quite widespread and passed directly along to the
standby server.  This bug nearly costed us a user as well although not
nearly so famous as uber :-).

In the few modern cases I've seen I've not been able to trace it back
to any bug in postgres (in particular multixact was ruled out) and
I've chalked it up to media or (more likely I think) filesystem
problems in the face of a -9 reset.

merlin