Re: Dreaming About Redesigning SQL - Mailing list pgsql-hackers
From | Bob Badour |
---|---|
Subject | Re: Dreaming About Redesigning SQL |
Date | |
Msg-id | 0J-dncfSRf9EJQiiU-KYuA@golden.net Whole thread Raw |
In response to | Re: Dreaming About Redesigning SQL (Christopher Browne <cbbrowne@acm.org>) |
List | pgsql-hackers |
"Lauri Pietarinen" <lauri.pietarinen@atbusiness.com> wrote in message news:bn4cca$dj0$1@nyytiset.pp.htv.fi... > Anthony W. Youngman wrote: > > >In article <3F94BCBB.7030001@atbusiness.com>, Lauri Pietarinen > ><lauri.pietarinen@atbusiness.com> writes > > > > > >>>Okay. Give me a FORMULA that returns a time in seconds for your query. > >>> > >>>Let's assume I want to print a statement of how many invoices were sent > >>>to a customer, along with various details of those invoices. My invoice > >>>file is indexed by company/month, and we can reasonably assume that the > >>>time taken to produce the statement is infinitesimal compared to the > >>>time taken to retrieve the invoice data from disk. For MV > >>> > >>>T = (2 + N) * ST * 1.05 > >>> > >>>Where T is the time taken to produce the report, N is the number of > >>>invoices, and ST is the hard disk seek time. > >>> > >>> > >>> > >>First of all it is important to note that an important component of all modern > >>SQL-DBMS's is > >>the buffer pool (or cache) meaning that in a reasonably well tuned database you > >>get very few > >>disk I/O's, even when *writing* data into tables. > >> > >> > > > >Fine. But MV *doesn't* *need* much of a cache. Let's assume both SQL and > >MV have the same amount of RAM to cache in - i.e. *not* *much*. I did > >say the spec said "extract maximum performance from the hardware > >available". > > > So what's wrong with gettng a machine with lots of memory? How much > does 2G of > memory for an Intel-box cost now a days? Is this some kind of new > ultimate sport, trying > to get along with as little memory as possible? > > >You're assuming that you can throw hardware at the problem - fine, but > >that's not always possible. You might have already maxed out the ram, > >you might have a "huge" database, you might be sharing your db server > >with other programs (BIND really likes to chew up every available drop > >of ram, doesn't it :-). > > > >I'm not saying that you shouldn't throw hardware at it, but what if you > >can't? > > > > > >>SQL-DBMS's also are very clever at using indexes, i.e. if they can find all > >>necessary data > >> > >> > >>from an index it will not even look at the table, so to speak. > > > >Same with MV > > > > > >>And, even when presuming conservatively that there is no data in cache, > >>depending on how > >>the data is clustered, you will get more than one row/disk read (= 8K in most(?) > >>systems). > >> > >> > > > >Same with MV > > > > > >>I could now denormalise OrderDetail so that it contains cust_id also > >>and cluster by cust_id > >>(might cause you trouble down the road, if you can change the customer > >>of an order), in which case, with 3 I/O's I would get > >>- 8 customer rows > >>- 16 order rows > >>- 24 order detail rows (which would all apply to one customer) > >> > >>Now the amout of I/O's would depend on how many detail rows > >>we have per customer. > >> > >>And, of course, because we are using sequential prefetch, we would be > >>getting more than one I/O block (8?, 16?) per disk seek, so it's a hard > >>comparison to > >>make but I suspect that it would about equal your example. > >> > >> > > > >Except my example was an *average* case, and yours is a *best* case. Oh, > >and my data is still normalised - I haven't had to denormalise it! AND I > >haven't run an optimiser over it :-) > > > Are you hiding your optimiser behind the curtain? ;-) > > Well, if it is normalised, how easy is it for you to change the > customer_id of an order? Anyway, > if we stick to your example and even if we don't normalise using e.g. > clustering features of Oracle, > as Bob pointed out, we are getting at most the same number of I/O's. > So, answer to your > question: our formula is at least as good as yours. Actually, Bob pointed out we are getting at most 12.5% as many disk head movements or I/O's. I'll take an 87.5% improvement any day. > >>Now, that was a *conservative* estimate, and we assumed that we did not have > >>any rows lying around in the (global!) cache. As the size of the cache grows in > >>proportion to the size of the total database we can assume less and less disk > >>I/O. > > > >You're relying on the hardware to bale you out :-) We can do the same! > > > Well why don't you? We achieved 8 times the performance with exactly the same hardware. What the hell is this idiot talking about us relying on hardware? He is a moron. You will do everyone a favour if you just bounce him off the bottom of your killfile. > >>Note also that the cache can be configured many ways, you can put different > >>tables (or indexes) in different caches, and even change the size of the cache > >>on the fly (you might want a bigger cache during evening and night when your > >>batch programs are running) so you can rig your system to favour certain > >>types of queries. > >> > >>I havn't even gone into the topic of using thick indexes so table access can > >>be totally avoided (=we are reading into memory only interesting columns). > >> > >>Now, in your example, what if the product department comes along and > >>wants to make a report with sales / product? What would be your formula > >>in that case? > > > >I'm not quite sure what you're trying to do. I'll assume you want a > >report of all invoices which refer to a given product. Assuming I've got > >the relevant indices defined, I can simply read a list of invoices from > >the product code index, a second list of invoices from the month index, > >and do an intersect of the two lists. > > > I want a list with all products with corresponding total sales, read > from order detail e.g. > > Hammer 10000$ > Nail 5000$ > Screw 1200$ > > How many disk reads (or head movements)? > > >So again, T = (2+N) * ST * 1.05 where N is the number of invoices that > >reference that product. And now ALL the invoice data has been retrieved > >from disk to ram ... > > > >>And: what if I was just reading customer-data. Would the same formula > >>apply (= (2+N)*ST*1.05)? > > > >Nope. If I understand you correctly, you want attributes that belong to > >the entity "customer", not the entity "invoice". T = ST * 1.05. (By the > >way, billing and/or invoice address (for example) are invoice > >attributes, not company attributes.) > > > No, I want you to give me a list of all your customers. How many disk > reads? > > >>>But as I understand relational theory, such a question is completely > >>>outside the scope of the theory. Seeing as it tries to divorce the > >>>database logic from the practical implementation ... > >>> > >>The theory, indeed, does not say anything about buffer pools, but by decoupling > >>logic > >>from implementation we leave the implementor (DBMS) to do as it feels fit to do. > > > >>As DBMS technology advances, we get faster systems without having to change our > >>programs. > > > >But with MV, if our database is too large for current technology, we > >kick the shit out of relational for speed ... This idiot is a fucking fool. He makes untrue assertions and actually believes them. > >Don't forget. You've already said that, if nothing is cached, my average > >case exceeds your best. No, actually, we did not. We already said that, assuming identical hardware and caching, our average case exceeds his best case by a factor of eight. If you are going to engage these ridiculously ignorant and stupid pick zealots, you must do a better job of identifying the horseshit they spout with just about every word. > And my case is *already* assuming that the > >system is seriously stressed and struggling ... > > > > > >>When we design databases we can decouple logical planning from performance > >>considerations, which, you must agree, are two separate issues. > >> > >> > >> > >>>And you know it's been proven that Huffman coding is the most efficient > >>>compression algorithm? (Actually, it isn't - it's been proven it can't > >>>be improved upon, which isn't the same thing...). Can you improve on the > >>>formula I've just given you? Yes, by a factor of eight as already demonstrated. > >>>Given that if we could change the 1.05 to 1 > >>>then we can prove it can't be improved upon ... Okay, let him prove it in spite of the factor of 8 improvement we already achieved. [remainder of Wol's unthinking, blind horseshit snipped]
pgsql-hackers by date: