Thread: Optimizer not using index on 120M row table
As background: this is the main table for http://stats.distributed.net Table "public.email_contrib" Column | Type | Modifiers ------------+---------------+----------- id | integer | not null team_id | integer | not null date | date | not null project_id | smallint | not null work_units | numeric(20,0) | not null Indexes: email_contrib_pkey primary key btree (project_id, id, date) id is the id of a participant, team_id is the team they were on for that day, date is the day the work was done, project_id is the project, and work_units is the amount of work done. explain select * from email_contrib where project_id=8 and id=39622 and date='3/1/03'; QUERY PLAN ------------------------------------------------------------------------------- Seq Scan on email_contrib (cost=0.00..2942185.40 rows=1 width=25) Filter: ((project_id = 8) AND (id = 39622) AND (date = '2003-03-01'::date)) (2 rows) Is there any reason why this shouldn't be using the index? The selectivity on project_id is very low (only 5 values for all 120M rows). select attname, avg_width, n_distinct, correlation from pg_stats where tablename='email_contrib'; attname | avg_width | n_distinct | correlation ------------+-----------+------------+------------- id | 4 | 95184 | 0.496598 team_id | 4 | 1361 | 0.219478 date | 4 | 1769 | 0.329469 project_id | 2 | 5 | 1 work_units | 11 | 2100 | 0.0900541 It seems that not only should the query I explained be using the index, but it seems that something like select sum(work_units) .. where project_id=8 and id=39622 should also use the index. I've run vacuum analyze to no effect. On a related note, will pgsql do 'index covering', reading only the index if it contains all the information a query needs? IE: in Sybase, this query will only hit the index on Email_Contrib: select date from email_contrib where project_id=8 and id=39622; because email_contrib_pkey contains all required values. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Mon, 2003-04-07 at 23:29, Jim C. Nasby wrote: > Table "public.email_contrib" > Column | Type | Modifiers > ------------+---------------+----------- > id | integer | not null > team_id | integer | not null > date | date | not null > project_id | smallint | not null > work_units | numeric(20,0) | not null > Indexes: email_contrib_pkey primary key btree (project_id, id, date) > explain select * from email_contrib where project_id=8 and id=39622 and > date='3/1/03'; Since project_id is an int2, you need to add single quotes to the integer literal or cast it to int2 explicitly: select * from email_contrib where project_id='8' ... or select * from email_contrib where project_id=8::int ... BTW, this is a (well) known bug -- search the archives for more information. > On a related note, will pgsql do 'index covering', reading only the > index if it contains all the information a query needs? No -- in PostgreSQL, tuple visibility information is only stored in the heap, not in the index. So the heap tuple always needs to be examined, to determine if the specified tuple has been updated/deleted by some transaction. Cheers, Neil
"Jim C. Nasby" <jim@nasby.net> writes: > project_id | smallint | not null > explain select * from email_contrib where project_id=8 and id=39622 and > date='3/1/03'; Cast the constants to smallint, eg project_id = 8::smallint and ... or quote them if that feels cleaner to you: project_id = '8' and ... or just declare project_id to be integer (hint: the space savings are completely illusory in this example, anyway, because of alignment requirements). See the archives for much prior discussion ;-) regards, tom lane
On Mon, Apr 07, 2003 at 11:51:17PM -0400, Tom Lane wrote: > or just declare project_id to be integer (hint: the space savings are > completely illusory in this example, anyway, because of alignment > requirements). Hrm... where can I find more info about that? -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > On Mon, Apr 07, 2003 at 11:51:17PM -0400, Tom Lane wrote: >> or just declare project_id to be integer (hint: the space savings are >> completely illusory in this example, anyway, because of alignment >> requirements). > Hrm... where can I find more info about that? The authoritative reference is pg_type.typalign ... see http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/catalog-pg-type.html In this example, your next field requires integer alignment, so the two bytes "saved" by using smallint disappear into the padding. regards, tom lane
On Mon, Apr 07, 2003 at 11:49:27PM -0400, Neil Conway wrote: > > On a related note, will pgsql do 'index covering', reading only the > > index if it contains all the information a query needs? > > No -- in PostgreSQL, tuple visibility information is only stored in the > heap, not in the index. So the heap tuple always needs to be examined, > to determine if the specified tuple has been updated/deleted by some > transaction. 'visibility information'? Is this related to the versioning that pgsql uses instead of read locks? I know in other RDBMS's it's very useful to have index covering because it eliminates needing to load the base pages into memory at all, but I don't know pgqsl's layout well enough to know how much this helps. Chapter 7 of the developers guide has info on page files, but I haven't seen info on how index tuples are stored. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Tue, 2003-04-08 at 00:18, Jim C. Nasby wrote: > 'visibility information'? Is this related to the versioning that pgsql > uses instead of read locks? Yes -- it's the information used by PostgreSQL to implement MVCC. > I know in other RDBMS's it's very useful to > have index covering because it eliminates needing to load the base pages > into memory at all Storing visibility information in index entries (in addition to heap tuples) doesn't strike me as a good idea: for one thing, an UPDATE/DELETE would require touching both heap tuples and any index entries that point to them. It would also bloat the size of indexes. Cheers, Neil
On Tue, Apr 08, 2003 at 12:02:55AM -0400, Tom Lane wrote: > The authoritative reference is pg_type.typalign ... see > http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/catalog-pg-type.html > In this example, your next field requires integer alignment, so the > two bytes "saved" by using smallint disappear into the padding. Thanks, makes sense to me now. Next question... will pgsql intelligently order things in an item record so that they're stored most efficiently? ie: group all the byte-align stuff together, all the 2-byte aligned stuff, etc.? Or will it just go off of the order defined in the create table statement? If it's the later, is there any way to re-organize things without rebuilding the table from scratch? What other things should be considered for column ordering? Thanks! -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Tue, Apr 08, 2003 at 12:29:00AM -0400, Neil Conway wrote: > > I know in other RDBMS's it's very useful to > > have index covering because it eliminates needing to load the base pages > > into memory at all > > Storing visibility information in index entries (in addition to heap > tuples) doesn't strike me as a good idea: for one thing, an > UPDATE/DELETE would require touching both heap tuples and any index > entries that point to them. It would also bloat the size of indexes. True, though these could require touching the indexes anyway due to the data changes (or at least I assume so, given how other RDBMS's work). It might be a useful option to allow; in certain situations this could provide fairly substantial gains, depending on how fat the index is compared to the raw data rows. It might also be worth allowing for a dirty read option if you don't care about transactions that are in progress. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > ... Next question... will pgsql intelligently > order things in an item record so that they're stored most efficiently? No, it stores 'em in the order they're declared in. There has been some speculation in the past about trying to be smarter, but I think that line of thought is pretty much dead in the face of ALTER TABLE ADD/DROP COLUMN. > What other things should be considered for column ordering? There is some execution-speed advantage to putting fixed-width never-null columns first, then fixed-width nullable columns (in order of increasing probability of being null, if you want to get really tense), finally the variable-width columns (nullability no longer matters). Note that CHAR(n) is considered variable-width, likewise BIT(n) and NUMERIC(n), even though one might think they could be considered fixed width. regards, tom lane
On Tue, 2003-04-08 at 00:35, Jim C. Nasby wrote: > On Tue, Apr 08, 2003 at 12:29:00AM -0400, Neil Conway wrote: > > Storing visibility information in index entries (in addition to heap > > tuples) doesn't strike me as a good idea: for one thing, an > > UPDATE/DELETE would require touching both heap tuples and any index > > entries that point to them. It would also bloat the size of indexes. > > True, though these could require touching the indexes anyway due to the > data changes (or at least I assume so, given how other RDBMS's work). Not in the case of DELETE (since you don't need to add a new index entry, and the logic to determine whether a tuple is deleted is based upon the visibility information stored in the heap). In the case of UPDATE, chances are that the index entry you need to add for the new version of the tuple isn't on the same page as the old one -- so the fact that the two pages happen to belong to the same index doesn't really help. > It might be a useful option to allow Indeed, I can see how some might find it useful. Perhaps you'd like to implement it? :-) Cheers, Neil
On Tue, Apr 08, 2003 at 02:04:30AM -0400, Neil Conway wrote: > On Tue, 2003-04-08 at 00:35, Jim C. Nasby wrote: > > On Tue, Apr 08, 2003 at 12:29:00AM -0400, Neil Conway wrote: > > > Storing visibility information in index entries (in addition to heap > > > tuples) doesn't strike me as a good idea: for one thing, an > > > UPDATE/DELETE would require touching both heap tuples and any index > > > entries that point to them. It would also bloat the size of indexes. > > > > True, though these could require touching the indexes anyway due to the > > data changes (or at least I assume so, given how other RDBMS's work). > > Not in the case of DELETE (since you don't need to add a new index > entry, and the logic to determine whether a tuple is deleted is based > upon the visibility information stored in the heap). In the case of > UPDATE, chances are that the index entry you need to add for the new > version of the tuple isn't on the same page as the old one -- so the > fact that the two pages happen to belong to the same index doesn't > really help. > > > It might be a useful option to allow > > Indeed, I can see how some might find it useful. Perhaps you'd like to > implement it? :-) Sure, if it can be done strictly in SQL. :) -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
There are a few settings that you pretty much have to change on bigger hardware to get good performance. shared_buffers (1000 to 10000 is a good place to start.) measured in 8k blocks. effective_cache_size (size of OS file system and disk caches measured in 8k blocks) The CPU cost settings: These two all measure the cost of each tuple operation (index or table respectively) as a fraction of a sequential scan. cpu_index_tuple_cost cpu_tuple_cost This one measures the cost of each operator in a where clause, again as a fraction of a sequential scan. cpu_operator_cost Lastly, the big one: random_page_cost tells the planner how much a random page fetch costs compared to a sequential page cost. A setting of one would mean that a seq scan and an index scan are even. Here's the settings off of my box, which runs a dual 10krpm UWSCSI disk set on a dual PIII-750 with 1.5 gig ram: shared_buffers 32768 #default 64 effective_cache_size = 100000 # default 1000 random_page_cost = 1 # default 4 cpu_tuple_cost = 0.01 # default 0.01 cpu_index_tuple_cost = 0.0001 # default 0.001 cpu_operator_cost = 0.0025 # default 0.0025 32768*8192=256M 100000*8192=780M Note that we originally set random_page_cost to 1 long before I'd realized we had our effective cache size set way too low (i.e. default.) and so the planner was picking seq scans because it was sure the data weren't in memory at the time. Setting effective cache size to the right setting means we could probably go back to a setting of 1.5 to 2. It seems that when postgresql picks an index scan when it should have picked a seq scan, the cost is that you're up to twice as slow as a seq scan, but picking a seq scan when it shoulda picked an index can result in performance 10 times slower, so we fall on the side of caution and favor index scans over seq scans because of this.
On Tue, Apr 29, 2003 at 09:46:20AM -0600, scott.marlowe wrote: > There are a few settings that you pretty much have to change on bigger > hardware to get good performance. > > shared_buffers (1000 to 10000 is a good place to start.) measured in 8k > blocks. > effective_cache_size (size of OS file system and disk caches measured in > 8k blocks) Should effective_cache_size include the size of shared_buffers? Also, FreeBSD doesn't seem to want to use more than about 300M for disk caching, so I currently have shared_buffers set to 90000 or about 700M (the box has 2G, but pgsql currently has to share with Sybase). Are there any issues with setting shared_buffers so high? Anyone know how to make FBSD do more caching on it's own, or should I just let pgsql handle it? -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > Should effective_cache_size include the size of shared_buffers? Yes ... although IMHO, if shared_buffers is large enough to materially affect that number, it's too large ;-) > FreeBSD doesn't seem to want to use more than about 300M for disk > caching, so I currently have shared_buffers set to 90000 or about 700M > (the box has 2G, but pgsql currently has to share with Sybase). Are > there any issues with setting shared_buffers so high? Plenty, see many past threads in pgsql-performance and other lists. There are strong reasons to think that you should let the kernel do the bulk of the caching work. regards, tom lane
On Wed, Apr 30, 2003 at 12:14:34AM -0400, Tom Lane wrote: > Plenty, see many past threads in pgsql-performance and other lists. > There are strong reasons to think that you should let the kernel do the > bulk of the caching work. OK, anyone have any ideas on how to get FreeBSD to cache more than 300M? :) BTW, all the searches I did on this topic and FreeBSD resulted in results saying I should set shared_buffers way high, so it seems FBSD users are having pgsql do the caching instead of the OS. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Wednesday 30 April 2003 11:26, Jim C. Nasby wrote: > On Wed, Apr 30, 2003 at 12:14:34AM -0400, Tom Lane wrote: > > Plenty, see many past threads in pgsql-performance and other lists. > > There are strong reasons to think that you should let the kernel do the > > bulk of the caching work. > > OK, anyone have any ideas on how to get FreeBSD to cache more than 300M? some sysctl tweaks? HAven't booted in BSD for months so not exactly helpful but still.. > BTW, all the searches I did on this topic and FreeBSD resulted in > results saying I should set shared_buffers way high, so it seems FBSD > users are having pgsql do the caching instead of the OS. Well, that is the postgresql philosphy on any platform. Further about setting the shared buffers, ideally you could watch ipcs output for shared memory usage and set the shared buffers between 120%-150% of average shared memory usage. Here one of the database I am playing with now has 80M rows and 6GB big.. but shared memory usage for random index scan queries does not seem to grow beyond 10MB. Of course there are not many queries fired on it but this is just to give you an idea.. HTH Shridhar
On Wed, Apr 30, 2003 at 12:56:11AM -0500, Jim C. Nasby wrote: > On Wed, Apr 30, 2003 at 12:14:34AM -0400, Tom Lane wrote: > > Plenty, see many past threads in pgsql-performance and other lists. > > There are strong reasons to think that you should let the kernel do the > > bulk of the caching work. > > OK, anyone have any ideas on how to get FreeBSD to cache more than 300M? > :) Ok, looks like I was just mis-reading top/not understanding the FBSD vm subsystem well enough. I'll try and dig up stuff in the archives about shared_buffer settings, but it seems like it would be really useful to have a whitepaper describing why the OS should be left to do caching, what pgsql actually tries to use it's shared buffers for (beyond simple read caching), and how best to size shared_buffers. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Tue, 29 Apr 2003, Jim C. Nasby wrote: > On Tue, Apr 29, 2003 at 09:46:20AM -0600, scott.marlowe wrote: > > There are a few settings that you pretty much have to change on bigger > > hardware to get good performance. > > > > shared_buffers (1000 to 10000 is a good place to start.) measured in 8k > > blocks. > > effective_cache_size (size of OS file system and disk caches measured in > > 8k blocks) > > Should effective_cache_size include the size of shared_buffers? Also, > FreeBSD doesn't seem to want to use more than about 300M for disk > caching, so I currently have shared_buffers set to 90000 or about 700M > (the box has 2G, but pgsql currently has to share with Sybase). Are > there any issues with setting shared_buffers so high? Anyone know how to > make FBSD do more caching on it's own, or should I just let pgsql handle > it? No, it doesn't. From reading the docs it would appear that it's the estimate of how much memory the kernel is using to cache disk access for postgresql alone. I.e. if you're running an LDAP server that uses 32 Megs, an apache server using another 32 Megs or so, and postgresql set for shared buffers of 32 Megs, then on a 256 Meg machine that shows 128 Megs used for cache, it's likely that only portion is postgresql cache, around 96 Megs if all three apps do the same amount of disk access. There may be some tuning parameters for BSD that will let it use more disk cache, or it could there just isn't any more to cache. There's no problem with cranking up shared_buffers, but you should always test it and compare it to a few smaller settings and find the "knee" where Postgresql stops getting faster. On my box that was around 128 Meg of ram, with a slight gain to 256 Meg, so, since I had 900 Megs as disk cache at the time I set it to 256 Meg. For my machine and my load, that's lots and lots of shared_buffer cache. There are questions about performance loss as the shared_buffer setting goes up due to the nature of postgresql's buffering method, which is based on shared memory. It would appear that most kernels are great at buffering huge amounts of data in a dirt simple way, while the shared memory that postgresql uses may start off faster (say at a few thousand buffers) but seems to reach a point of diminishing returns in real tests. For me that was somewhere around 512 Meg of shared buffer memory. Note that it's not like Postgresql ground to a halt and stopped working, it just stopped going faster, and started getting a little slower. It might still be a win for certain types of loads, just not mine. The load my database runs is about 90% small lookups (select * from view where id=123) and 5% medium reports (select * from view where date>1 month ago) and about 5% gargantuan reports (select * from sales_history where year like '199%'). The problem for me is that only the small lookup type apps are changing a lot of data, and refetching all the time. The medium reports may get run a dozen times in one sitting, but most of the time it's a one shot. The gargantuan reports are run by batch files mostly, or by single users who know it's ok that it takes a minute or two to get their report and they won't hit reload six or seven times. :-) The thing that slowed down the most at high buffer for me was the small access applications, i.e. work flow, order processing type stuff. It didn't benefit from a larger cache because it's working over a small dataset, so any performance loss due to overhead was seen here first, since there was no corresponding advantage to a large cache on a small dataset. The big reports were actually almost the same speed, sometimes still faster at 512Meg than 256Meg, but that makes sense, since select * from logs order by random() is gonna take up some space :-) especially if you're joining it to get a username -> realname lookup. But all those big things are faster if you can run them on a cron and have them ready in the morning. And the machine doesn't care if it's 20% faster at 2:00am. But shared_buffers isn't the dangerous setting, that's sort_mem cranked up too high. Since sort_mem comes from plain old memory, not the shared_buffers, a bunch of order by statements running in parallel can make the kernel flush it's cache, then start swapping out programs if things get real bad. Causing the kernel to flush cache means that you can get into a round robin cache problem where everything the database asks for was just here a minute ago, but we just flushed it to make room for this last block here. And that happens for each block you're looking for. Bad as that is, it's nothing compared to making your machine start swapping, especially if what it's swapping are shared buffer blocks... Which many OSes (Linux is one not sure about the BSDs) will happily do. After that the danger in Linux is that you'll use up all the swap, all the mem, and the kernel will start kill -9ing "rogue" processes. Unfortunately, the "rogue process recognition algorhythm" in Linux is kill that which is big and hungry. I.e. Postgresql backends. So, I think the process of setting those settings for a machine is by it's nature an interactive thing, you have to build it and then put it under load and measure it's performance and tweak it a little at a time. The real danger is in overshooting the optimum. The first sign is that things aren't getting faster, and the next sign might be that the machine is crawling in a swap storm. Mediocre but reliable is preferable to meteoric (both in terms of speed and reliability.) What I shoot for now is to have half of the memory in a box be kernel cache, 1/4th or so be postgresql programs and cache, and the other 1/4th to run anything else on the box. On my current machine that's 1.5 Gig mem, 700-800M used for kernel disk cache, 256 Meg used for postgresql, and about 400-500M used for everything else. Postgresql performance is really not an issue unless we write a poorly designed query. And if we do, we can't usually use up enough memory to cause problems.