Thread: Postgres vr.s MySQL- style differences?
I'm still new to databases (though not to computing in general), and I'll admit that I haven't ever really used MySQL. But listening in to the various conversations comparing MySQL to Postgres, one recurring theme I've seen is that accessing the different databases requires different styles. And that many of the problems in switching from one database to the other is mainly a matter of problems with the style. Especially when the style is encoded in software. Take, for example, inserting large numbers of records into a table. As someone who was raised on Postgres (so to speak), when loading large numbers (more than 10 or so) records into a table I immediate start reaching for a copy command. And our code base is littered with copies- from simple programs that given a CSV file and a table name burst-load a table, to complicated libraries that use multithreaded queues and switch between copies and transactional inserts depending upon how much data needs to be copied. But someone used to MySQL would just start firing off lone, non-transactional insert statements, and their code base wouldn't contain such programs or libraries. Or take transactions. Our code operates on the assumption that the database handles concurrency just fine. And that the code can feel free to haul off and do whatever it likes- insert, delete, update, or select whatever data it feels like- whenever it feels like, without regard to whoever else is accessing the same data. Don't worry, the database will sort it all out for you. I'd be terrified to take this code base to a non-transactional database, because I have no idea where the synchronization issues are- they haven't been a performance issue, so no thought whatsoever has been given to them beyond "let the database handle it". But I don't think there are many, or that they'd be that hard to code around- if the code base had been written from the get go knowing that we couldn't rely on the database for concurrency, adjustments could be made. As a side note, given a choice between having to specially handle inserts vr.s specially handle synchronization, I'll take specially handling inserts any day. Or take the multi-table joins. I used to laugh at the idea of 27-table joins, until I found one in our code. I was doing a complicated query against a view, which was joining together several other views, which were joining together several other views, which were joining together views where were joining together tables. No, I'm not exagerating for comedic effect, the views really did go four layers deep. Why? Because Postgres just handled it, so there was no real pushback for making views of views of views of views. I only noticed because the complicated query was somewhat slow (not intolerably slow, just somewhat slow) and went investigating why. If we were working with a database that didn't handle multi-table joins nearly so well, we probably would have redesigned the table structure at some point to limit the depth of the views, and the number of tables in the query. My point here is to try to explain why moving code between MySQL and Postgres, in either direction, always seems fraught with difficulty. And why the new database always seems to be of less value than the old database. If you've already spent the time and effort to avoid synchronization problems outside the database, and to limit the number of tables you're joining in your queries, then not having to do these things has zero value. On the other hand, if you haven't already special cased inserts, having to now go back and special case inserts is a cost. Likewise, if you've already paid the cost to use copy instead of inserts on inserting, not having to do so is a minor advantage- but if you have given little thought to limiting the number of tables you join (barring performance problems), or synchronization issues (beyond an occassional begin and commit), you're looking at a large cost on switching. Just my $0.02. Brian
On Mon, May 28, 2007 at 10:29:59AM -0400, Brian Hurt wrote: > different styles. And that many of the problems in switching from one > database to the other is mainly a matter of problems with the style. > Especially when the style is encoded in software. Right. This is not only true of PostgreSQL-MySQL comparisons, though. Even with databases that are rather more compliant with the ANSI specifications than MySQL, there are plenty of details that you have to pay attention to. (MySQL is a special case, really, because in its early history it actively disparaged many of the features that non-toy databases took to be obvious goods, like referential integrity and ACID.) I would argue that IBM tries pretty hard to conform with the ANSI SQL spec in DB2, for instance. Yet moving from Postgres DB2 (or conversely) would not be trivial, if you wanted to preserve your optimizations. DB2's approach to concurrency causes what is called lock escalation. In most cases, it works well, and the DB2 people are _really smart_ and have implemented impressive algorithms to handle this. But if you're used to MVCC, the chances are pretty good under DB2 that you'll eventually run into lock starvation, and find that you have strange occasions of suddenly near-stopped performance unless you go through your application and port it very carefully. This is why many applications that have some portability layer for the database seem to be so awful at using the database: they can't use any of the strengths of any of the underlying databases, because those strengths are usually the thing that some other database is really bad at. So you often end up with an application that isn't truly stellar at anything, and your database system has to be oversized to perform acceptably. (Note that there's nothing intrinisically wrong with this approach; you've just optimised your code for portability at the expense of performance.) SQL really gives a minimal, rather than maximal portability. It's nice that PostgreSQL tends to conform with SQL, though, because at least if you did have to move in or (less likely :) out of it, your reference point is fairly well-defined. A -- Andrew Sullivan | ajs@crankycanuck.ca The whole tendency of modern prose is away from concreteness. --George Orwell
Brian Hurt wrote: > > Take, for example, inserting large numbers of records into a table. As > someone who was raised on Postgres (so to speak), when loading large > numbers (more than 10 or so) records into a table I immediate start > reaching for a copy command. And our code base is littered with copies- > from simple programs that given a CSV file and a table name burst-load a > table, to complicated libraries that use multithreaded queues and switch > between copies and transactional inserts depending upon how much data > needs to be copied. But someone used to MySQL would just start firing > off lone, non-transactional insert statements, and their code base > wouldn't contain such programs or libraries. Just to make sure that nobody gets confused here. MySQL has LODA DATA INFILE as their COPY alternative, which performs marvelously. regards, Lukas
Andrew Sullivan wrote: > This is why many applications that have some portability layer for > the database seem to be so awful at using the database: they can't > use any of the strengths of any of the underlying databases, because > those strengths are usually the thing that some other database is > really bad at. So you often end up with an application that isn't > truly stellar at anything, and your database system has to be > oversized to perform acceptably. (Note that there's nothing > intrinisically wrong with this approach; you've just optimised your > code for portability at the expense of performance.) Yup, the tend to heavily rely on data integrity management on the middle tier. Which by the way is not totally crazy, because you need most of the data integrity rules on the frontend anyways, to generate proper GUI's. Actually in non object-relational RDBMS (which are the norm, PostgreSQL is special), you can never do your full integrity management inside the database. Which is all kind of ok, as long as you never mess with the database through anything but your single middle tier implementation. My point is, there are solid reasons for writing applications that use the database as a mostly stupid data store with very efficient retrieval and these applications tend to also perform pretty well regardless of what RDBMS you stick underneath. regards, Lukas
On Mon, May 28, 2007 at 09:37:53PM +0200, Lukas Kahwe Smith wrote: > inside the database. Which is all kind of ok, as long as you never mess > with the database through anything but your single middle tier > implementation. That's a pretty big if. It's part of the reason I really dislike the idea of moving integrity work out of the db and into the application. For it's not only ensuring the data only gets updated by the application that you need worry about; you also have to make sure that all the application programmers (and there will be more than one, for any application that lasts any length of time or does anything non-trivial) do all the integrity stuff in the same way. And they rarely do. The whole point of centralising your data is not only to share it; it's to share it in a consistent way. Otherwise, why not just dump it on the filesystem? > My point is, there are solid reasons for writing applications that use > the database as a mostly stupid data store with very efficient retrieval > and these applications tend to also perform pretty well regardless of > what RDBMS you stick underneath. RDBMS are often not actually very efficient retrieval engines. If that's what you need, straight filesystem access is a better bet. The _R_ in RDBMS is there for a reason, and if you're not going to use your database in a relational way, then you don't need an RDBMS. You _can_ use the RDBMS as a mostly stupid data store, and it likely won't break anything. But it seems a waste of computing cycles to me. Often, the "stupid datastore" systems I've seen look like they've used SQL because that was the hammer the programmer had, rather than the right tool for the job. (This is optimising for developer learning curve rather than anything else. Also a legitimate trade-off, if you want to make it, so long as you know you're doing so.) A -- Andrew Sullivan | ajs@crankycanuck.ca I remember when computers were frustrating because they *did* exactly what you told them to. That actually seems sort of quaint now. --J.D. Baldwin
Andrew Sullivan wrote: > RDBMS are often not actually very efficient retrieval engines. If Yeah, well but it is more convinient, than writing your own :) regards, Lukas
On May 28, 2007, at 2:37 PM, Lukas Kahwe Smith wrote: > Andrew Sullivan wrote: >> This is why many applications that have some portability layer for >> the database seem to be so awful at using the database: they can't >> use any of the strengths of any of the underlying databases, because >> those strengths are usually the thing that some other database is >> really bad at. So you often end up with an application that isn't >> truly stellar at anything, and your database system has to be >> oversized to perform acceptably. (Note that there's nothing >> intrinisically wrong with this approach; you've just optimised your >> code for portability at the expense of performance.) > > Yup, the tend to heavily rely on data integrity management on the > middle tier. Which by the way is not totally crazy, because you > need most of the data integrity rules on the frontend anyways, to > generate proper GUI's. Actually in non object-relational RDBMS > (which are the norm, PostgreSQL is special), you can never do your > full integrity management inside the database. Can you give an example of that, because I can't think of one, unless you're talking about integrity that has to lie outside the database (ie: if you're storing filesystem locations in the database). > Which is all kind of ok, as long as you never mess with the > database through anything but your single middle tier implementation. I do a lot of work on systems where it would be impractical or impossible to do everything through middleware. The other issue is that you have to be very, very careful how you do integrity checking in the middleware if you want to avoid things like race conditions. The database not only gives that to you for free, but it's also extremely efficient at it. > My point is, there are solid reasons for writing applications that > use the database as a mostly stupid data store with very efficient > retrieval and these applications tend to also perform pretty well > regardless of what RDBMS you stick underneath. Perhaps if you carefully hand-code your middleware... my experience is that things like Hibernate tend to destroy performance because unless you really, really know it well you end up with some pretty bad database access methods. It doesn't have to be that way, but it often is. Put another way, if a database is a foot-gun when it comes to performance, middleware is a foot-bazooka. :) -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On May 28, 2007, at 9:29 AM, Brian Hurt wrote: > I'm still new to databases (though not to computing in general), > and I'll admit that I haven't ever really used MySQL. But > listening in to the various conversations comparing MySQL to > Postgres, one recurring theme I've seen is that accessing the > different databases requires different styles. And that many of > the problems in switching from one database to the other is mainly > a matter of problems with the style. Especially when the style is > encoded in software. > > Take, for example, inserting large numbers of records into a > table. As someone who was raised on Postgres (so to speak), when > loading large numbers (more than 10 or so) records into a table I > immediate start reaching for a copy command. And our code base is > littered with copies- from simple programs that given a CSV file > and a table name burst-load a table, to complicated libraries that > use multithreaded queues and switch between copies and > transactional inserts depending upon how much data needs to be > copied. But someone used to MySQL would just start firing off > lone, non-transactional insert statements, and their code base > wouldn't contain such programs or libraries. And that someone would have horrid performance in *any* ACID RDBMS. This is one of the things I hate about MySQL: it teaches people how to abuse an RDBMS. Of course, that came around and bit them in the rear when they put InnoDB in: since it's (mostly) ACID, inserts suddenly started to suck. So to make their users happy they then added a bunch of knobs that let you dial-back the ACIDity. > Or take the multi-table joins. I used to laugh at the idea of 27- > table joins, until I found one in our code. I was doing a > complicated query against a view, which was joining together > several other views, which were joining together several other > views, which were joining together views where were joining > together tables. No, I'm not exagerating for comedic effect, the > views really did go four layers deep. Why? Because Postgres just > handled it, so there was no real pushback for making views of views > of views of views. I only noticed because the complicated query > was somewhat slow (not intolerably slow, just somewhat slow) and > went investigating why. If we were working with a database that > didn't handle multi-table joins nearly so well, we probably would > have redesigned the table structure at some point to limit the > depth of the views, and the number of tables in the query. I have a story about the small-time mentality that MySQL and some other tools generate. Someone posted what he didn't like about Ruby On Rails on a Joel On Software forum. One of the beefs was that rails won't pick up RI information from the database, because the author of rails has this hair-brained idea that the database should be nothing but a stupid data dump (if that's really what he wanted, why the heck didn't just use flat files??) Of course, this resulted in a big flamewar between what were largely professional DBAs (from various RDBMS backgrounds) and the rails fanboys. One of the threads was talking about what features you need when you have large numbers of tables... to which one of the rails fans replied: "I guess I could see that if you had a *really* complex database, with say, 30 tables." > My point here is to try to explain why moving code between MySQL > and Postgres, in either direction, always seems fraught with > difficulty. And why the new database always seems to be of less > value than the old database. If you've already spent the time and > effort to avoid synchronization problems outside the database, and > to limit the number of tables you're joining in your queries, then > not having to do these things has zero value. On the other hand, > if you haven't already special cased inserts, having to now go back > and special case inserts is a cost. Likewise, if you've already > paid the cost to use copy instead of inserts on inserting, not > having to do so is a minor advantage- but if you have given little > thought to limiting the number of tables you join (barring > performance problems), or synchronization issues (beyond an > occassional begin and commit), you're looking at a large cost on > switching. Switching databases always has some kind of cost, because none of them work the same. The problem with MySQL is that it has the ability to act *very* different from any other database. Shunning ANSI SQL is one example, but there's things like silently truncating data, mis- handling nulls, etc. (google:"mysql gotchas" for a good starting list). Similarly, moving from DB2, MSSQL, or Sybase to PostgreSQL or Oracle is easier than the other way around, because PostgreSQL and Oracle are MVCC, which means you spend *far* less time worrying about locking issues. With non-MVCC databases, you have to be very conscious of what the access pattern is for each table and make sure not to break that. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim Nasby wrote: > I do a lot of work on systems where it would be impractical or > impossible to do everything through middleware. The other issue is that > you have to be very, very careful how you do integrity checking in the > middleware if you want to avoid things like race conditions. The > database not only gives that to you for free, but it's also extremely > efficient at it. Data integrity in the data just seems to make sense on the face of it. Consider enterprise databases, with many different applications accessing any given database, some of which reside outside the hosting enterprise. It is unreasonable if not impossible to route all such applications through a single (choke-) point on the way to the database, another single point. As Jim says, enterprise-scale DMBSes are good at handling this sort of load. > Perhaps if you carefully hand-code your middleware... my experience is > that things like Hibernate tend to destroy performance because unless > you really, really know it well you end up with some pretty bad database > access methods. It doesn't have to be that way, but it often is. I've seen similar difficulties with OO-relational frameworks. The trap of such a framework is that getting to "really, really know it well" requires at least as much database expertise as without the framework, plus you have to know the framework even better. Even then you still often "end up with some pretty bad database access"; best case is you get a subset of the database system's power for an increase in application complexity. This is the irony lost on the proponents of these frameworks: "If you work really, really hard it'll save you effort, except in development or maintenance." -- Lew
On Tue, May 29, 2007 at 09:18:23AM -0400, Lew wrote: > still often "end up with some pretty bad database access"; best case is you > get a subset of the database system's power for an increase in application > complexity. This is the irony lost on the proponents of these frameworks: > "If you work really, really hard it'll save you effort, except in > development or maintenance." Well, maybe not. Remember, there are lots of ways to optimise your code design. One is to design for the tools you have and know how to use. If you have a bunch of developers who know Beans (sorry, I couldn't resist), and you have six weeks to deliver functionality that oughta take six months to do properly, then you just throw together the things you know how to do. This makes for much less good code, of course, and will cost in the long run. But for immediate-term problems, it might be a good trade-off. (There are plenty of companies who then never do step two, which is to throw all that away and do the job properly. But that's just bad management, and is not an argument that the original trade-off was necessarily a bad one.) A -- Andrew Sullivan | ajs@crankycanuck.ca The plural of anecdote is not data. --Roger Brinner
Jim Nasby wrote: >> Yup, the tend to heavily rely on data integrity management on the >> middle tier. Which by the way is not totally crazy, because you need >> most of the data integrity rules on the frontend anyways, to generate >> proper GUI's. Actually in non object-relational RDBMS (which are the >> norm, PostgreSQL is special), you can never do your full integrity >> management inside the database. > > Can you give an example of that, because I can't think of one, unless > you're talking about integrity that has to lie outside the database (ie: > if you're storing filesystem locations in the database). Yes, but more trivially since like "is this a valid email address" (note I said this is not the case with object-relational databases like PostgreSQL). Checking for the format of an email address is just something I would want to do with a check constraint (especially if I have to replicate that check constraint in every relevant column). > Perhaps if you carefully hand-code your middleware... my experience is > that things like Hibernate tend to destroy performance because unless > you really, really know it well you end up with some pretty bad database > access methods. It doesn't have to be that way, but it often is. Yeah, I think at the very least ORM's make it less obvious when you are calling something expensive. It all happens sort of magical and you do not really know when something is going to cost you and if so how much. regards, Lukas
On Tue, May 29, 2007 at 07:26:51PM +0200, Lukas Kahwe Smith wrote: > Jim Nasby wrote: > > >>Yup, the tend to heavily rely on data integrity management on the > >>middle tier. Which by the way is not totally crazy, because you > >>need most of the data integrity rules on the frontend anyways, to > >>generate proper GUI's. Actually in non object-relational RDBMS > >>(which are the norm, PostgreSQL is special), you can never do your > >>full integrity management inside the database. > > > >Can you give an example of that, because I can't think of one, > >unless you're talking about integrity that has to lie outside the > >database (ie: if you're storing filesystem locations in the > >database). > > Yes, but more trivially since like "is this a valid email address" This is where PostgreSQL really shines. You can create a DOMAIN constraint that checks whether an address is valid all the way up to checking whether it's accepted for delivery, although I'd imagine you wouldn't want to do that last. You can then use that DOMAIN all over your code. > >Perhaps if you carefully hand-code your middleware... my experience > >is that things like Hibernate tend to destroy performance because > >unless you really, really know it well you end up with some pretty > >bad database access methods. It doesn't have to be that way, but it > >often is. > > Yeah, I think at the very least ORM's make it less obvious when you > are calling something expensive. It all happens sort of magical and > you do not really know when something is going to cost you and if so > how much. All this magic means that you're taking on a larger hassle, more subtle bugs, etc., etc. for no long-term benefit. Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote! Consider donating to PostgreSQL: http://www.postgresql.org/about/donate
David Fetter wrote: > On Tue, May 29, 2007 at 07:26:51PM +0200, Lukas Kahwe Smith wrote: >> Jim Nasby wrote: >> >>>> Yup, the tend to heavily rely on data integrity management on the >>>> middle tier. Which by the way is not totally crazy, because you >>>> need most of the data integrity rules on the frontend anyways, to >>>> generate proper GUI's. Actually in non object-relational RDBMS >>>> (which are the norm, PostgreSQL is special), you can never do your >>>> full integrity management inside the database. >>> Can you give an example of that, because I can't think of one, >>> unless you're talking about integrity that has to lie outside the >>> database (ie: if you're storing filesystem locations in the >>> database). >> Yes, but more trivially since like "is this a valid email address" > > This is where PostgreSQL really shines. You can create a DOMAIN > constraint that checks whether an address is valid all the way up to > checking whether it's accepted for delivery, although I'd imagine you > wouldn't want to do that last. You can then use that DOMAIN all over > your code. I know! I know! /me hugs PostgreSQL Actually while thinking of an example for my previous post I simply used the "email validation" example from Roberts pl/php talk. regards, Lukas
Andrew Sullivan wrote: > Remember, there are lots of ways to optimise your > code design. One is to design for the tools you have and know how to > use. If you have a bunch of developers who know Beans (sorry, I > couldn't resist), and you have six weeks to deliver functionality > that oughta take six months to do properly, then you just throw > together the things you know how to do. This makes for much less > good code, of course, and will cost in the long run. But for > immediate-term problems, it might be a good trade-off. (There are > plenty of companies who then never do step two, which is to throw all > that away and do the job properly. But that's just bad management, > and is not an argument that the original trade-off was necessarily a > bad one.) I completely agree. And laugh at your pun. My post had to do with the ideal; yours focuses on the pragmatic. If your one-off approach employs standard patterns (like JavaBeans) and programmers steeped in a culture of excellence, the refactoring to a full-blown architecture is much less painful. My own practices are slighlty schizoid - I aim for technical elegance but frequently make compromises for expediency. -- Lew