on 4/14/00 11:08 PM, Thomas at englim@pc.jaring.my wrote:
> Stephen J Lombardo wrote:
>
>> ... Consider a table where you have some round number of
>> tuples, say 100,000. Suppose you had b-tree indexes on two attributes,
>> employee_id (primary key) and last_name. Now if you were to run a query to
>> look up an employee by the primary key you would surly want to use the
>> index. Assume that it would take 3 disk accesses to search the index, and
>> one to fetch the data page from the heap. So you have a total of 4 disk
>> accesses to search on primary key and retrieve on row. Now suppose you were
>> going to run a query that would return a significant number of rows, lets
>> say half the table (50,000). Now if the optimizer chose to use the index on
>> that query it would take 4 disk access to locate each and every row (3 to
>> search index, 1 to grab data page). So if the query ran using the index it
>> would use 200,000 (50,000 * 4) disk accesses (Worst case scenario of course.
>> Using CLUSTER could improve the efficiency). Lets assume that the average
>> size of a tuple is 500k. So PostgreSQL would pack about 16 tuples into a
>> single page. Therefore doing a sequential search on the table would require
>> 100,000/16, or 6250 disk accesses.
>
> I don't exactly know the inner workings of the btree used in PostgreSQL, but
> is
> the calculation accurate? I would think the query would initially use the
> index
> (if it decided to use it) to do the lookup, then traverse the index using
> get_next, which means it doesn't take (50,000 * 4) disk accesses. Traversing
> the
> index to return all the 50,000 records in a 100,000 record database should
> theoretically be less expensive than sequentially scanning the whole database.
Tht would be assuming that all of the keys existed under on of the same
leaves of the b-tree in a query using inequalities. Granted, this is the
most likely scenario, but others are possible, hence my statement "Worst
case scenario of course." ie. if the query had the effect of selecting rows
whose keys were uniformly distributed in the index.
But even if all keys were under one parent node in the b-tree it could
be less efficient. Consider a case where the actual tuples are uniformly
distributed in the heap. As the index was traversed the probability of the
next row being located on another page is very high. So chances are good
that PostgreSQL will have to fetch a page for each and every row.
So assume it would take 3 DA to initially search the btree, 1DA for each
other index lookup, and 1DA for each row. You still end up with 3+(2*50,000)
DAs. Of course, my second comment above was " Using CLUSTER could improve
the efficiency". If you were to cluster the heap according to the btree than
an index lookup would be more, or at least equally, efficient as compared to
a sequential scan because the heap itself would be organized so that the
probability of the next fetched row requiring a new disk access is low.
>> f "
>> SELECT state FROM table_name WHERE state='NY';
>>
>> The optimizer will see if it has any statistics on this table. If not it
>> will make a guess at how many rows are returned. So the optimizer guesses
>> that 1% of the table, or 10,000 rows, will be returned. Then it will use
>> that number to asses how to run the query. Now if it had statistics on the
>> table the optimizer would know that there were only 5 different values in
>> the states column of the table. So the optimizer would assume that 20% of
>> the table would be returned from the query. It is likely that the optimizer
>> will choose a very different plan when it thinks that 200,000 rows will be
>> returned.
>
> Likewise, if there's an index on state, use the index for lookup and traverse
> the
> index without even being bothered with the number of records the query is
> going to
> return (if the index is updated correctly in updates/inserts).
> Please remember that other databases do not have to "vacuum" to be optimized
> in
> its query plan.
As per the example above using an index to fetch 200,000 rows would be
extremely expensive. But using it to fetch 10,000 might be reasonable.
PostgreSQL does not require that you have a vacuum analyzed database to do
optimization, but it will do BETTER optimization if it has statistics to
work with.
Remember that vacuum is neccessary because of PostgreSQL's use of
multiversion concurrency control. MVCC requires that multiple copies of
tuples (or at least attributes) be maintained, because different
transactions may have different views of the data at any given time. Vacuum
eliminates old copies of the data and compacts the heap. Every DBMS that I
am aware of that uses MVCC has some similar requirement to vacuum because
the DBMS does not inherently know when it is ok to purge and compact
Vacuum analyze, on the other hand, collects statistics about the data.
AFAIK there are still commercial relational databases out there that require
you to collect statistics in a seperate process (they are not collected
dynamcally), Ingres for example. In theory the use of MVCC complicates this
matter because if PostgreSQL were to collect stats dynamically it would need
to know which copy of the row to use. Generating stats during a vacuum
ensures that there is only one set of data for each tuple, thus simplifying
the matter greatly.
Disclaimer: Calculations may be inaccurate. When talking about optimization
you are delving deep into the area of cost approximation theory, which is
mostly educated guess work. Efficiency depends largely on the implementation
of the DBMS in question. The cost calculations are therefore only a way to
draw general comparisons between the ESTIMATED efficiency of queries.
Cheers,
Stephen