Re: extremly low memory usage - Mailing list pgsql-performance
From | John A Meinel |
---|---|
Subject | Re: extremly low memory usage |
Date | |
Msg-id | 43061472.2080201@arbash-meinel.com Whole thread Raw |
In response to | Re: extremly low memory usage (Jeremiah Jahn <jeremiah@cs.earlham.edu>) |
Responses |
Re: extremly low memory usage
Re: extremly low memory usage |
List | pgsql-performance |
Jeremiah Jahn wrote: > Sorry about the formatting. > > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote: > >>Jeremiah Jahn wrote: >> >> ... >>The expensive parts are the 4915 lookups into the litigant_details (each >>one takes approx 4ms for a total of ~20s). >>And then you do it again on case_data (average 3ms each * 4906 loops = >>~15s). > > Is there some way to avoid this? > Well, in general, 3ms for a single lookup seems really long. Maybe your index is bloated by not vacuuming often enough. Do you tend to get a lot of updates to litigant_details? There are a couple possibilities at this point. First, you can REINDEX the appropriate index, and see if that helps. However, if this is a test box, it sounds like you just did a dump and reload, which wouldn't have bloat in an index. Another possibility. Is this the column that you usually use when pulling information out of litigant_details? If so, you can CLUSTER litigant_details on the appropriate index. This will help things be close together that should be, which decreases the index lookup costs. However, if this is not the common column, then you probably will slow down whatever other accesses you may have on this table. After CLUSTER, the current data will stay clustered, but new data will not, so you have to continually CLUSTER, the same way that you might VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as expensive as a VACUUM FULL. Be aware of this, but it might vastly improve your performance, so it would be worth it. > > >>So there is no need for preloading your indexes on the identity table. >>It is definitely not the bottleneck. >> >>So a few design bits, which may help your database. >>Why is "actor_id" a text field instead of a number? > > This is simply due to the nature of the data. > I'm just wondering if changing into a number, and using a number->name lookup would be faster for you. It may not be. In general, I prefer to use numbers for references. I may be over paranoid, but I know that some locales are bad with string -> string comparisons. And since the data in your database is stored as UNICODE, I'm not sure if it has to do any translating or not. Again, something to consider, it may not make any difference. > >>You could try creating an index on "litigant_details (actor_id, >>count_ori)" so that it can do just an index lookup, rather than an index >>+ filter. > > I have one, but it doesn't seem to like to use it. Don't really need it > though, I can just drop the court_id out of the query. It's redundant, > since each actor_id is also unique in litigant details. I had run vac > full and analyze but I ran them again anyway and the planning improved. > However, my 14 disk raid 10 array is still slower than my 3 disk raid 5 > on my production box. 46sec vs 30sec (with live traffic on the > production) One of the strange things is that when I run the cat command > on my index and tables that are "HOT" it has no effect on memory usage. > Right now I'm running ext3 on LVM. I'm still in a position to redo the > file system and everything. Is this a good way to do it or should I > switch to something else? What about stripe and extent sizes...? kernel > parameters to change? Well, the plans are virtually identical. There is one small difference as to whether it joins against case_data or court first. But 'court' is very tiny (small enough to use a seqscan instead of index scan) I'm a little surprised with court being this small that it doesn't do something like a hash aggregation, but court takes no time anyway. The real problem is that your nested loop index time is *much* slower. Devel: -> Index Scan using lit_actor_speed on litigant_details (cost=0.00..3.96 rows=1 width=81) (actual time=4.788..4.812 rows=1 loops=5057) Production: -> Index Scan using lit_actor_speed on litigant_details (cost=0.00..5.63 rows=1 width=81) (actual time=3.355..3.364 rows=1 loops=5057) Devel: -> Index Scan using case_speed on case_data (cost=0.00..3.46 rows=1 width=26) (actual time=4.222..4.230 rows=1 loops=5052) Production: -> Index Scan using case_data_pkey on case_data (cost=0.00..5.31 rows=1 width=26) (actual time=1.897..1.904 rows=1 loops=5052) Notice that the actual per-row cost is as much as 1/2 less than on your devel box. As a test, can you do "time cat $index_file >/dev/null" a couple of times. And then determine the MB/s. Alternatively run vmstat in another shell. If the read/s doesn't change, then you know the "cat" is being served from RAM, and thus it really is cached. I can point you to REINDEX and CLUSTER, but if it is caching in ram, I honestly can't say why the per loop would be that much slower. Are both systems running the same postgres version? It sounds like it is different (since you say something about switching to 8.0). I doubt it, but you might try an 8.1devel version. ... >>Do you regularly vacuum analyze your tables? >>Just as a test, try running: >>set enable_nested_loop to off; > > not quite acceptable > Total runtime: 221486.149 ms > Well, the estimates are now at least closer (3k vs 5k instead of 1), and it is still choosing nested loops. So they probably are faster. I would still be interested in the actual EXPLAIN ANALYZE with nested loops disabled. It is possible that *some* of the nested loops are performing worse than they have to. But really, you have worse index speed, and that needs to be figured out. John =:->
Attachment
pgsql-performance by date: