Thread: extremly low memory usage
I just put together a system with 6GB of ram on a 14 disk raid 10 array. When I run my usual big painful queries, I get very little to know memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used most of the time. the new devel box sits at around 250MB. I've switched to an 8.0 system on the new devel box, but the .conf really didn't change. Index usage is the same. Something seems wrong and I'm not sure why. any thoughts, -jj- shared_buffers = 32768 # min 16, at least max_connections*2, 8KB each work_mem = 2097151 # min 64, size in KB maintenance_work_mem = 819200 # min 1024, size in KB max_fsm_pages = 80000 # min max_fsm_relations*16, 6 bytes each checkpoint_segments = 30 # in logfile segments, min 1, 16MB each effective_cache_size = 3600000 <-----this is a little out of control, but would it have any real effect? random_page_cost = 2 # units are one sequential page fetch cost log_min_duration_statement = 10000 # -1 is disabled, in milliseconds. lc_messages = 'C' # locale for system error message strings lc_monetary = 'C' # locale for monetary formatting lc_numeric = 'C' # locale for number formatting lc_time = 'C' # locale for time formatting -- "Now this is a totally brain damaged algorithm. Gag me with a smurfette." -- P. Buhr, Computer Science 354
Jeremiah Jahn wrote: > I just put together a system with 6GB of ram on a 14 disk raid 10 array. > When I run my usual big painful queries, I get very little to know > memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used > most of the time. the new devel box sits at around 250MB. > > I've switched to an 8.0 system on the new devel box, but the .conf > really didn't change. Index usage is the same. Something seems wrong and > I'm not sure why. > How big is your actual database on disk? And how much of it is actually touched by your queries? It seems that your tough queries might only be exercising a portion of the database. If you really want to make memory usage increase try something like: find . -type f -print0 | xargs -0 cat >/dev/null Which should read all the files. After doing that, does the memory usage increase? > > any thoughts, > -jj- > > > shared_buffers = 32768 # min 16, at least max_connections*2, 8KB each > work_mem = 2097151 # min 64, size in KB This seems awfully high. 2GB Per sort? This might actually be flushing some of your ram, since it would get allocated and filled, and then freed when finished. Remember, depending on what you are doing, this amount can get allocated more than once per query. > maintenance_work_mem = 819200 # min 1024, size in KB > max_fsm_pages = 80000 # min max_fsm_relations*16, 6 bytes each > checkpoint_segments = 30 # in logfile segments, min 1, 16MB each > effective_cache_size = 3600000 <-----this is a little out of control, but would it have any real effect? It should just tell the planner that it is more likely to have buffers in cache, so index scans are slightly cheaper than they would otherwise be. > random_page_cost = 2 # units are one sequential page fetch cost > log_min_duration_statement = 10000 # -1 is disabled, in milliseconds. > lc_messages = 'C' # locale for system error message strings > lc_monetary = 'C' # locale for monetary formatting > lc_numeric = 'C' # locale for number formatting > lc_time = 'C' # locale for time formatting > John =:->
Attachment
On Aug 17, 2005, at 10:11 PM, Jeremiah Jahn wrote: > I just put together a system with 6GB of ram on a 14 disk raid 10 > array. > When I run my usual big painful queries, I get very little to know > memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used > most of the time. the new devel box sits at around 250MB. > Is the system performing fine? Are you touching as much data as the production box? If the system is performing fine don't worry about it. > work_mem = 2097151 # min 64, size in KB This is EXTREMELY high. You realize this is the amount of memory that can be used per-sort and per-hash build in a query? You can end up with multiples of this on a single query. If you have some big queries that are run infrequently have them set it manually. > effective_cache_size = 3600000 <-----this is a little out of > control, but would it have any real effect? This doesn't allocate anything - it is a hint to the planner about how much data it can assume is cached. -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
On Wed, 2005-08-17 at 21:21 -0500, John A Meinel wrote: > Jeremiah Jahn wrote: > > I just put together a system with 6GB of ram on a 14 disk raid 10 array. > > When I run my usual big painful queries, I get very little to know > > memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used > > most of the time. the new devel box sits at around 250MB. > > > > I've switched to an 8.0 system on the new devel box, but the .conf > > really didn't change. Index usage is the same. Something seems wrong and > > I'm not sure why. > > > > How big is your actual database on disk? And how much of it is actually > touched by your queries? The DB is about 60GB. About 10GB is actually used in real queries, versus get me this single record with this ID. I have a large query that finds court cases based on certain criteria that is name based. I get a full seq scan on the name table in about 7 seconds, This table has about 6 million names (most being 'smith, something'). The index scan takes much less time of course, once it's been cached (somewhere but not apparently memory). The really query can take 60 seconds on a first run. And 1.3 seconds on a second run. I'm very happy with the cached results, just not really sure where that caching is happening since it doesn't show up as memory usage. I do know that the caching that happens seems to be independent of the DB. I can restart the DB and my speeds are still the same as the cached second query. Is there some way to pre-cache some of the tables/files on the file system? If I switch my query to search for 'jones%' instead of 'smith%', I take a hit. But if I then rerun the smith search, I still get cached speed. I only have two tables essentially names and events that have to do any real work ie. not very atomic data. I'd love to be able to force these two tables into a cache somewhere. This is a linux system (RHEL ES4) by the way. > > It seems that your tough queries might only be exercising a portion of > the database. If you really want to make memory usage increase try > something like: > find . -type f -print0 | xargs -0 cat >/dev/null > Which should read all the files. After doing that, does the memory usage > increase? > > > > > any thoughts, > > -jj- > > > > > > shared_buffers = 32768 # min 16, at least max_connections*2, 8KB each > > work_mem = 2097151 # min 64, size in KB > > This seems awfully high. 2GB Per sort? This might actually be flushing > some of your ram, since it would get allocated and filled, and then > freed when finished. Remember, depending on what you are doing, this > amount can get allocated more than once per query. What's a good way to determine the optimal size? > > > maintenance_work_mem = 819200 # min 1024, size in KB > > max_fsm_pages = 80000 # min max_fsm_relations*16, 6 bytes each > > checkpoint_segments = 30 # in logfile segments, min 1, 16MB each > > effective_cache_size = 3600000 <-----this is a little out of control, but would it have any real effect? > > It should just tell the planner that it is more likely to have buffers > in cache, so index scans are slightly cheaper than they would otherwise be. > > > random_page_cost = 2 # units are one sequential page fetch cost > > log_min_duration_statement = 10000 # -1 is disabled, in milliseconds. > > lc_messages = 'C' # locale for system error message strings > > lc_monetary = 'C' # locale for monetary formatting > > lc_numeric = 'C' # locale for number formatting > > lc_time = 'C' # locale for time formatting > > > > John > =:-> -- "Now this is a totally brain damaged algorithm. Gag me with a smurfette." -- P. Buhr, Computer Science 354
Jeremiah Jahn wrote: >On Wed, 2005-08-17 at 21:21 -0500, John A Meinel wrote: > > >>Jeremiah Jahn wrote: >> >> >>>I just put together a system with 6GB of ram on a 14 disk raid 10 array. >>>When I run my usual big painful queries, I get very little to know >>>memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used >>>most of the time. the new devel box sits at around 250MB. >>> >>>I've switched to an 8.0 system on the new devel box, but the .conf >>>really didn't change. Index usage is the same. Something seems wrong and >>>I'm not sure why. >>> >>> >>> >>How big is your actual database on disk? And how much of it is actually >>touched by your queries? >> >> >The DB is about 60GB. About 10GB is actually used in real queries, >versus get me this single record with this ID. I have a large query that >finds court cases based on certain criteria that is name based. I get a >full seq scan on the name table in about 7 seconds, This table has about >6 million names (most being 'smith, something'). The index scan takes >much less time of course, once it's been cached (somewhere but not >apparently memory). The really query can take 60 seconds on a first run. >And 1.3 seconds on a second run. I'm very happy with the cached results, >just not really sure where that caching is happening since it doesn't >show up as memory usage. I do know that the caching that happens seems >to be independent of the DB. I can restart the DB and my speeds are >still the same as the cached second query. Is there some way to >pre-cache some of the tables/files on the file system? If I switch my >query to search for 'jones%' instead of 'smith%', I take a hit. But if I >then rerun the smith search, I still get cached speed. I only have two >tables essentially names and events that have to do any real work ie. >not very atomic data. I'd love to be able to force these two tables into >a cache somewhere. This is a linux system (RHEL ES4) by the way. > > I think what is happening is that *some* of the index pages are being cached, just not all of them. Most indexes (if you didn't specify anything special) are btree, so that you load the root page, and then determine what pages need to be loaded from there. So the "jones%" pages aren't anywhere near the "smith%" pages. And don't need to be loaded if you aren't accessing them. So the required memory usage might be smaller than you think. At least until all of the index pages have been accessed. The reason it is DB independent is because the OS is caching a file access (you read a file, it keeps the old pages in RAM in case you ask for it again). Part of the trick, is that as you use the database, it will cache what has been used. So you may not need to do anything. It should sort itself out with time. However, if you have to have cached performance as soon as your machine reboots, you could figure out what files on disk represent your indexes and tables, and then just "cat $files >/dev/null" That should cause a read on those files, which should pull them into the memory cache. *However* this will fail if the size of those files is greater than available memory, so you may want to be a little bit stingy about what you preload. Alternatively, you could just write an SQL script which runs a bunch of indexed queries to make sure all the pages get loaded. Something like: FOR curname IN SELECT DISTINCT name FROM users LOOP SELECT name FROM users WHERE name=curname; END LOOP; That should make the database go through the entire table, and load the index for every user. This is overkill, and will probably take a long time to execute. But you could do it if you wanted. >>It seems that your tough queries might only be exercising a portion of >>the database. If you really want to make memory usage increase try >>something like: >>find . -type f -print0 | xargs -0 cat >/dev/null >>Which should read all the files. After doing that, does the memory usage >>increase? >> >> >> >>>any thoughts, >>>-jj- >>> >>> >>>shared_buffers = 32768 # min 16, at least max_connections*2, 8KB each >>>work_mem = 2097151 # min 64, size in KB >>> >>> >>This seems awfully high. 2GB Per sort? This might actually be flushing >>some of your ram, since it would get allocated and filled, and then >>freed when finished. Remember, depending on what you are doing, this >>amount can get allocated more than once per query. >> >> >What's a good way to determine the optimal size? > > Practice. :) A few questions I guess... How many concurrent connections are you expecting? How many joins does a standard query have? How big are the joins? In general, I would tend to make this a smaller number, so that the os has more room to cache tables, rather than having big buffers for joins. If someone is requesting a join that requires a lot of rows, I would rather *that* query be slower, than impacting everyone else. I would put it more with a maximum in the 20-100MB range. John =:-> > > >>>maintenance_work_mem = 819200 # min 1024, size in KB >>>max_fsm_pages = 80000 # min max_fsm_relations*16, 6 bytes each >>>checkpoint_segments = 30 # in logfile segments, min 1, 16MB each >>>effective_cache_size = 3600000 <-----this is a little out of control, but would it have any real effect? >>> >>> >>It should just tell the planner that it is more likely to have buffers >>in cache, so index scans are slightly cheaper than they would otherwise be. >> >> >> >>>random_page_cost = 2 # units are one sequential page fetch cost >>>log_min_duration_statement = 10000 # -1 is disabled, in milliseconds. >>>lc_messages = 'C' # locale for system error message strings >>>lc_monetary = 'C' # locale for monetary formatting >>>lc_numeric = 'C' # locale for number formatting >>>lc_time = 'C' # locale for time formatting >>> >>> >>> >>John >>=:-> >> >>
Attachment
here's an example standard query. Ireally have to make the first hit go faster. The table is clustered as well on full_name as well. 'Smith%' took 87 seconds on the first hit. I wonder if I set up may array wrong. I remeber see something about DMA access versus something else, and choose DMA access. LVM maybe? explain analyze select distinct case_category,identity_id,court.name,litigant_details.case_id,case_year,date_of_birth,assigned_case_role,litigant_details.court_ori,full_name,litigant_details.actor_id,case_data.type_code,case_data.subtype_code,litigant_details.impound_litigant_data, to_number(trim(leadingcase_data.type_code from trim(leading case_data.case_year from case_data.case_id)),'999999') as seqfrom identity,court,litigant_details,case_data where identity.court_ori = litigant_details.court_ori and identity.case_id= litigant_details.case_id and identity.actor_id = litigant_details.actor_id and court.id = identity.court_oriand identity.court_ori = case_data.court_ori and case_data.case_id = identity.case_id and identity.court_ori= 'IL081025J' and full_name like 'MILLER%' order by full_name; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=20411.84..20411.91 rows=2 width=173) (actual time=38340.231..38355.120 rows=4906 loops=1) -> Sort (cost=20411.84..20411.84 rows=2 width=173) (actual time=38340.227..38343.667 rows=4906 loops=1) Sort Key: identity.full_name, case_data.case_category, identity.identity_id, court.name, litigant_details.case_id,case_data.case_year, identity.date_of_birth, litigant_details.assigned_case_role, litigant_details.court_ori,litigant_details.actor_id, case_data.type_code, case_data.subtype_code, litigant_details.impound_litigant_data,to_number(ltrim(ltrim((case_data.case_id)::text, (case_data.case_year)::text), (case_data.type_code)::text),'999999'::text) -> Nested Loop (cost=0.00..20411.83 rows=2 width=173) (actual time=12.891..38317.017 rows=4906 loops=1) -> Nested Loop (cost=0.00..20406.48 rows=1 width=159) (actual time=12.826..23232.106 rows=4906 loops=1) -> Nested Loop (cost=0.00..20403.18 rows=1 width=138) (actual time=12.751..22885.439 rows=4906 loops=1) Join Filter: (("outer".case_id)::text = ("inner".case_id)::text) -> Index Scan using name_speed on identity (cost=0.00..1042.34 rows=4868 width=82) (actual time=0.142..52.538rows=4915 loops=1) Index Cond: (((full_name)::text >= 'MILLER'::character varying) AND ((full_name)::text <'MILLES'::character varying)) Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~ 'MILLER%'::text)) -> Index Scan using lit_actor_speed on litigant_details (cost=0.00..3.96 rows=1 width=81) (actualtime=4.631..4.635 rows=1 loops=4915) Index Cond: (("outer".actor_id)::text = (litigant_details.actor_id)::text) Filter: ('IL081025J'::text = (court_ori)::text) -> Seq Scan on court (cost=0.00..3.29 rows=1 width=33) (actual time=0.053..0.062 rows=1 loops=4906) Filter: ('IL081025J'::text = (id)::text) -> Index Scan using case_speed on case_data (cost=0.00..5.29 rows=3 width=53) (actual time=3.049..3.058rows=1 loops=4906) Index Cond: (('IL081025J'::text = (case_data.court_ori)::text) AND ((case_data.case_id)::text = ("outer".case_id)::text)) Total runtime: 38359.722 ms (18 rows) copa=> explain analyze select distinct case_category,identity_id,court.name,litigant_details.case_id,case_year,date_of_birth,assigned_case_role,litigant_details.court_ori,full_name,litigant_details.actor_id,case_data.type_code,case_data.subtype_code,litigant_details.impound_litigant_data, to_number(trim(leadingcase_data.type_code from trim(leading case_data.case_year from case_data.case_id)),'999999') as seqfrom identity,court,litigant_details,case_data where identity.court_ori = litigant_details.court_ori and identity.case_id= litigant_details.case_id and identity.actor_id = litigant_details.actor_id and court.id = identity.court_oriand identity.court_ori = case_data.court_ori and case_data.case_id = identity.case_id and identity.court_ori= 'IL081025J' and full_name like 'MILLER%' order by full_name; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=20411.84..20411.91 rows=2 width=173) (actual time=666.832..688.081 rows=4906 loops=1) -> Sort (cost=20411.84..20411.84 rows=2 width=173) (actual time=666.825..671.833 rows=4906 loops=1) Sort Key: identity.full_name, case_data.case_category, identity.identity_id, court.name, litigant_details.case_id,case_data.case_year, identity.date_of_birth, litigant_details.assigned_case_role, litigant_details.court_ori,litigant_details.actor_id, case_data.type_code, case_data.subtype_code, litigant_details.impound_litigant_data,to_number(ltrim(ltrim((case_data.case_id)::text, (case_data.case_year)::text), (case_data.type_code)::text),'999999'::text) -> Nested Loop (cost=0.00..20411.83 rows=2 width=173) (actual time=0.216..641.366 rows=4906 loops=1) -> Nested Loop (cost=0.00..20406.48 rows=1 width=159) (actual time=0.149..477.063 rows=4906 loops=1) -> Nested Loop (cost=0.00..20403.18 rows=1 width=138) (actual time=0.084..161.045 rows=4906 loops=1) Join Filter: (("outer".case_id)::text = ("inner".case_id)::text) -> Index Scan using name_speed on identity (cost=0.00..1042.34 rows=4868 width=82) (actual time=0.047..37.898rows=4915 loops=1) Index Cond: (((full_name)::text >= 'MILLER'::character varying) AND ((full_name)::text <'MILLES'::character varying)) Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~ 'MILLER%'::text)) -> Index Scan using lit_actor_speed on litigant_details (cost=0.00..3.96 rows=1 width=81) (actualtime=0.015..0.017 rows=1 loops=4915) Index Cond: (("outer".actor_id)::text = (litigant_details.actor_id)::text) Filter: ('IL081025J'::text = (court_ori)::text) -> Seq Scan on court (cost=0.00..3.29 rows=1 width=33) (actual time=0.049..0.056 rows=1 loops=4906) Filter: ('IL081025J'::text = (id)::text) -> Index Scan using case_speed on case_data (cost=0.00..5.29 rows=3 width=53) (actual time=0.017..0.020rows=1 loops=4906) Index Cond: (('IL081025J'::text = (case_data.court_ori)::text) AND ((case_data.case_id)::text = ("outer".case_id)::text)) Total runtime: 694.639 ms (18 rows) On Thu, 2005-08-18 at 09:00 -0400, Jeff Trout wrote: > On Aug 17, 2005, at 10:11 PM, Jeremiah Jahn wrote: > > > I just put together a system with 6GB of ram on a 14 disk raid 10 > > array. > > When I run my usual big painful queries, I get very little to know > > memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used > > most of the time. the new devel box sits at around 250MB. > > > > Is the system performing fine? Are you touching as much data as the > production box? > > If the system is performing fine don't worry about it. > > > work_mem = 2097151 # min 64, size in KB > > This is EXTREMELY high. You realize this is the amount of memory > that can be used per-sort and per-hash build in a query? You can end > up with multiples of this on a single query. If you have some big > queries that are run infrequently have them set it manually. > > > effective_cache_size = 3600000 <-----this is a little out of > > control, but would it have any real effect? > > This doesn't allocate anything - it is a hint to the planner about > how much data it can assume is cached. > > -- > Jeff Trout <jeff@jefftrout.com> > http://www.jefftrout.com/ > http://www.stuarthamm.net/ > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match -- "Now this is a totally brain damaged algorithm. Gag me with a smurfette." -- P. Buhr, Computer Science 354
Jeremiah Jahn wrote: >here's an example standard query. Ireally have to make the first hit go >faster. The table is clustered as well on full_name as well. 'Smith%' >took 87 seconds on the first hit. I wonder if I set up may array wrong. >I remeber see something about DMA access versus something else, and >choose DMA access. LVM maybe? > > It would be nice if you would format your queries to be a little bit easier to read before posting them. However, I believe I am reading it correctly, to say that the index scan on identity is not your slow point. In fact, as near as I can tell, it only takes 52ms to complete. The expensive parts are the 4915 lookups into the litigant_details (each one takes approx 4ms for a total of ~20s). And then you do it again on case_data (average 3ms each * 4906 loops = ~15s). So there is no need for preloading your indexes on the identity table. It is definitely not the bottleneck. So a few design bits, which may help your database. Why is "actor_id" a text field instead of a number? You could try creating an index on "litigant_details (actor_id, count_ori)" so that it can do just an index lookup, rather than an index + filter. More importantly, though, the planner seems to think the join of identity to litigant_details will only return 1 row, not 5000. Do you regularly vacuum analyze your tables? Just as a test, try running: set enable_nested_loop to off; And then run EXPLAIN ANALYZE again, just to see if it is faster. You probably need to increase some statistics targets, so that the planner can design better plans. > -> Nested Loop (cost=0.00..20411.83 rows=2 width=173) > (actual time=12.891..38317.017 rows=4906 loops=1) > -> Nested Loop (cost=0.00..20406.48 rows=1 width=159) > (actual time=12.826..23232.106 rows=4906 loops=1) > -> Nested Loop (cost=0.00..20403.18 rows=1 > width=138) (actual time=12.751..22885.439 rows=4906 loops=1) > Join Filter: (("outer".case_id)::text = > ("inner".case_id)::text) > -> Index Scan using name_speed on > identity (cost=0.00..1042.34 rows=4868 width=82) (actual > time=0.142..52.538 rows=4915 loops=1) > Index Cond: (((full_name)::text >= > 'MILLER'::character varying) AND ((full_name)::text < > 'MILLES'::character varying)) > Filter: (((court_ori)::text = > 'IL081025J'::text) AND ((full_name)::text ~~ 'MILLER%'::text)) > -> Index Scan using lit_actor_speed on > litigant_details (cost=0.00..3.96 rows=1 width=81) (actual > time=4.631..4.635 rows=1 loops=4915) > Index Cond: (("outer".actor_id)::text > = (litigant_details.actor_id)::text) > Filter: ('IL081025J'::text = > (court_ori)::text) > -> Seq Scan on court (cost=0.00..3.29 rows=1 > width=33) (actual time=0.053..0.062 rows=1 loops=4906) > Filter: ('IL081025J'::text = (id)::text) > -> Index Scan using case_speed on case_data > (cost=0.00..5.29 rows=3 width=53) (actual time=3.049..3.058 rows=1 > loops=4906) > Index Cond: (('IL081025J'::text = > (case_data.court_ori)::text) AND ((case_data.case_id)::text = > ("outer".case_id)::text)) John =:->
Attachment
At 01:55 PM 8/18/2005, John Arbash Meinel wrote: >Jeremiah Jahn wrote: > > >here's an example standard query. Ireally have to make the first hit go > >faster. The table is clustered as well on full_name as well. 'Smith%' > >took 87 seconds on the first hit. I wonder if I set up may array wrong. > >I remeber see something about DMA access versus something else, and > >choose DMA access. LVM maybe? > > > > >It would be nice if you would format your queries to be a little bit >easier to read before posting them. >However, I believe I am reading it correctly, to say that the index scan >on identity is not your slow point. In fact, as near as I can tell, it >only takes 52ms to complete. > >The expensive parts are the 4915 lookups into the litigant_details (each >one takes approx 4ms for a total of ~20s). >And then you do it again on case_data (average 3ms each * 4906 loops = >~15s). How big are litigant_details and case_data? If they can fit in RAM, preload them using methods like the "cat to /dev/null" trick and those table lookups will be ~100-1000x faster. If they won't fit into RAM but the machine can be expanded to hold enough RAM to fit the tables, it's well worth the ~$75-$150/GB to upgrade the server so that the tables will fit into RAM. If they can't be made to fit into RAM as atomic entities, you have a few choices: A= Put the data tables and indexes on separate dedicated spindles and put litigant_details and case_data each on their own dedicated spindles. This will lower seek conflicts. Again it this requires buying some more HDs, it's well worth it. B= Break litigant_details and case_data into a set of smaller tables (based on something sane like the first n characters of the primary key) such that the smaller tables easily fit into RAM. Given that you've said only 10GB/60GB is "hot", this could work very well. Combine it with "A" above (put all the litigant_details sub tables on one dedicated spindle set and all the case_data sub tables on another spindle set) for added oomph. C= Buy a SSD big enough to hold litigant_details and case_data and put them there. Again, this can be combined with "A" and "B" above to lessen the size of the SSD needed. >So there is no need for preloading your indexes on the identity >table. It is definitely not the bottleneck. > >So a few design bits, which may help your database. Why is >"actor_id" a text field instead of a number? >You could try creating an index on "litigant_details (actor_id, >count_ori)" so that it can do just an index lookup, rather than an >index+ filter. Yes, that certainly sounds like it would be more efficient. >More importantly, though, the planner seems to think the join of >identity to litigant_details will only return 1 row, not 5000. >Do you regularly vacuum analyze your tables? >Just as a test, try running: >set enable_nested_loop to off; >And then run EXPLAIN ANALYZE again, just to see if it is faster. > >You probably need to increase some statistics targets, so that the >planner can design better plans. > > > -> Nested Loop (cost=0.00..20411.83 rows=2 width=173) (actual > time=12.891..38317.017 rows=4906 loops=1) > > -> Nested Loop (cost=0.00..20406.48 rows=1 width=159)(actual > time=12.826..23232.106 rows=4906 loops=1) > > -> Nested Loop (cost=0.00..20403.18 rows=1 width=138) > (actual time=12.751..22885.439 rows=4906 loops=1) > > Join Filter: (("outer".case_id)::text = > ("inner".case_id)::text) > > -> Index Scan using name_speed on > identity (cost=0.00..1042.34 rows=4868 width=82) (actual time=0.142..52.538 > > rows=4915 loops=1) > > Index Cond: (((full_name)::text >= > 'MILLER'::character varying) AND ((full_name)::text < > 'MILLES'::character varying)) > > Filter: (((court_ori)::text = > 'IL081025J'::text) AND ((full_name)::text ~~ 'MILLER%'::text)) > > -> Index Scan using lit_actor_speed on > litigant_details (cost=0.00..3.96 rows=1 width=81) (actual > > time=4.631..4.635 rows=1 loops=4915) > > Index Cond: (("outer".actor_id)::text = > (litigant_details.actor_id)::text) > > Filter: ('IL081025J'::text = (court_ori)::text) > > -> Seq Scan on court (cost=0.00..3.29 > rows=1 width=33) (actual time=0.053..0.062 rows=1 loops=4906) > > Filter: ('IL081025J'::text = (id)::text) > > -> Index Scan using case_speed on > case_data (cost=0.00..5.29 rows=3 width=53) (actual time=3.049..3.058 > > rows=1 loops=4906) > > Index Cond: (('IL081025J'::text > = (case_data.court_ori)::text) AND ((case_data.case_id)::text = > > ("outer".case_id)::text))
Sorry about the formatting. On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote: > Jeremiah Jahn wrote: > > >here's an example standard query. Ireally have to make the first hit go > >faster. The table is clustered as well on full_name as well. 'Smith%' > >took 87 seconds on the first hit. I wonder if I set up may array wrong. > >I remeber see something about DMA access versus something else, and > >choose DMA access. LVM maybe? > > > > > It would be nice if you would format your queries to be a little bit > easier to read before posting them. > However, I believe I am reading it correctly, to say that the index scan > on identity is not your slow point. In fact, as near as I can tell, it > only takes 52ms to complete. > > The expensive parts are the 4915 lookups into the litigant_details (each > one takes approx 4ms for a total of ~20s). > And then you do it again on case_data (average 3ms each * 4906 loops = > ~15s). Is there some way to avoid this? > > So there is no need for preloading your indexes on the identity table. > It is definitely not the bottleneck. > > So a few design bits, which may help your database. > Why is "actor_id" a text field instead of a number? This is simply due to the nature of the data. > You could try creating an index on "litigant_details (actor_id, > count_ori)" so that it can do just an index lookup, rather than an index > + filter. I have one, but it doesn't seem to like to use it. Don't really need it though, I can just drop the court_id out of the query. It's redundant, since each actor_id is also unique in litigant details. I had run vac full and analyze but I ran them again anyway and the planning improved. However, my 14 disk raid 10 array is still slower than my 3 disk raid 5 on my production box. 46sec vs 30sec (with live traffic on the production) One of the strange things is that when I run the cat command on my index and tables that are "HOT" it has no effect on memory usage. Right now I'm running ext3 on LVM. I'm still in a position to redo the file system and everything. Is this a good way to do it or should I switch to something else? What about stripe and extent sizes...? kernel parameters to change? ---------------devel box:----------------------- copa=# EXPLAIN ANALYZE select full_name,identity_id,identity.case_id,court.id,date_of_birth,assigned_case_role,litigant_details.impound_litigant_data copa-# from identity copa-# join litigant_details on identity.actor_id = litigant_details.actor_id copa-# join case_data on litigant_details.case_id = case_data.case_id and litigant_details.court_ori = case_data.court_ori copa-# join court on identity.court_ori = court.id copa-# where identity.court_ori = 'IL081025J' and full_name like 'JONES%' order by full_name; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=3.29..29482.22 rows=3930 width=86) (actual time=114.060..46001.480 rows=5052 loops=1) -> Nested Loop (cost=3.29..16193.27 rows=3820 width=112) (actual time=93.038..24584.275 rows=5052 loops=1) -> Nested Loop (cost=0.00..16113.58 rows=3820 width=113) (actual time=85.778..24536.489 rows=5052 loops=1) -> Index Scan using name_speed on identity (cost=0.00..824.72 rows=3849 width=82) (actual time=50.284..150.133rows=5057 loops=1) Index Cond: (((full_name)::text >= 'JONES'::character varying) AND ((full_name)::text < 'JONET'::charactervarying)) Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~ 'JONES%'::text)) -> Index Scan using lit_actor_speed on litigant_details (cost=0.00..3.96 rows=1 width=81) (actual time=4.788..4.812rows=1 loops=5057) Index Cond: (("outer".actor_id)::text = (litigant_details.actor_id)::text) -> Materialize (cost=3.29..3.30 rows=1 width=12) (actual time=0.002..0.003 rows=1 loops=5052) -> Seq Scan on court (cost=0.00..3.29 rows=1 width=12) (actual time=7.248..7.257 rows=1 loops=1) Filter: ('IL081025J'::text = (id)::text) -> Index Scan using case_speed on case_data (cost=0.00..3.46 rows=1 width=26) (actual time=4.222..4.230 rows=1 loops=5052) Index Cond: ((("outer".court_ori)::text = (case_data.court_ori)::text) AND (("outer".case_id)::text = (case_data.case_id)::text)) Total runtime: 46005.994 ms > > More importantly, though, the planner seems to think the join of > identity to litigant_details will only return 1 row, not 5000. > Do you regularly vacuum analyze your tables? > Just as a test, try running: > set enable_nested_loop to off; not quite acceptable Total runtime: 221486.149 ms > And then run EXPLAIN ANALYZE again, just to see if it is faster. > > You probably need to increase some statistics targets, so that the > planner can design better plans. ---------------------this is the output from the production box------------------ LOG: duration: 27213.068 ms statement: EXPLAIN ANALYZE select full_name,identity_id,identity.case_id,court.id,date_of_birth,assigned_case_role,litigant_details.impound_litigant_data from identity join litigant_details on identity.actor_id = litigant_details.actor_id join case_data on litigant_details.case_id = case_data.case_id and litigant_details.court_ori = case_data.court_ori join court on identity.court_ori = court.id where identity.court_ori = 'IL081025J' and full_name like 'JONES%' order by full_name; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=3.29..43498.76 rows=2648 width=86) (actual time=17.106..27192.000 rows=5052 loops=1) -> Nested Loop (cost=0.00..43442.53 rows=2647 width=87) (actual time=16.947..27120.619 rows=5052 loops=1) -> Nested Loop (cost=0.00..23061.79 rows=3827 width=113) (actual time=16.801..17390.682 rows=5052 loops=1) -> Index Scan using name_speed on identity (cost=0.00..1277.39 rows=3858 width=82) (actual time=9.842..213.424rows=5057 loops=1) Index Cond: (((full_name)::text >= 'JONES'::character varying) AND ((full_name)::text < 'JONET'::charactervarying)) Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~ 'JONES%'::text)) -> Index Scan using lit_actor_speed on litigant_details (cost=0.00..5.63 rows=1 width=81) (actual time=3.355..3.364rows=1 loops=5057) Index Cond: (("outer".actor_id)::text = (litigant_details.actor_id)::text) -> Index Scan using case_data_pkey on case_data (cost=0.00..5.31 rows=1 width=26) (actual time=1.897..1.904 rows=1loops=5052) Index Cond: ((("outer".court_ori)::text = (case_data.court_ori)::text) AND (("outer".case_id)::text = (case_data.case_id)::text)) -> Materialize (cost=3.29..3.30 rows=1 width=12) (actual time=0.002..0.003 rows=1 loops=5052) -> Seq Scan on court (cost=0.00..3.29 rows=1 width=12) (actual time=0.142..0.165 rows=1 loops=1) Filter: ('IL081025J'::text = (id)::text) Total runtime: 27205.060 ms > > > John > =:-> > -- "I didn't know it was impossible when I did it."
Jeremiah Jahn wrote: > Sorry about the formatting. > > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote: > >>Jeremiah Jahn wrote: >> >> ... >>The expensive parts are the 4915 lookups into the litigant_details (each >>one takes approx 4ms for a total of ~20s). >>And then you do it again on case_data (average 3ms each * 4906 loops = >>~15s). > > Is there some way to avoid this? > Well, in general, 3ms for a single lookup seems really long. Maybe your index is bloated by not vacuuming often enough. Do you tend to get a lot of updates to litigant_details? There are a couple possibilities at this point. First, you can REINDEX the appropriate index, and see if that helps. However, if this is a test box, it sounds like you just did a dump and reload, which wouldn't have bloat in an index. Another possibility. Is this the column that you usually use when pulling information out of litigant_details? If so, you can CLUSTER litigant_details on the appropriate index. This will help things be close together that should be, which decreases the index lookup costs. However, if this is not the common column, then you probably will slow down whatever other accesses you may have on this table. After CLUSTER, the current data will stay clustered, but new data will not, so you have to continually CLUSTER, the same way that you might VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as expensive as a VACUUM FULL. Be aware of this, but it might vastly improve your performance, so it would be worth it. > > >>So there is no need for preloading your indexes on the identity table. >>It is definitely not the bottleneck. >> >>So a few design bits, which may help your database. >>Why is "actor_id" a text field instead of a number? > > This is simply due to the nature of the data. > I'm just wondering if changing into a number, and using a number->name lookup would be faster for you. It may not be. In general, I prefer to use numbers for references. I may be over paranoid, but I know that some locales are bad with string -> string comparisons. And since the data in your database is stored as UNICODE, I'm not sure if it has to do any translating or not. Again, something to consider, it may not make any difference. > >>You could try creating an index on "litigant_details (actor_id, >>count_ori)" so that it can do just an index lookup, rather than an index >>+ filter. > > I have one, but it doesn't seem to like to use it. Don't really need it > though, I can just drop the court_id out of the query. It's redundant, > since each actor_id is also unique in litigant details. I had run vac > full and analyze but I ran them again anyway and the planning improved. > However, my 14 disk raid 10 array is still slower than my 3 disk raid 5 > on my production box. 46sec vs 30sec (with live traffic on the > production) One of the strange things is that when I run the cat command > on my index and tables that are "HOT" it has no effect on memory usage. > Right now I'm running ext3 on LVM. I'm still in a position to redo the > file system and everything. Is this a good way to do it or should I > switch to something else? What about stripe and extent sizes...? kernel > parameters to change? Well, the plans are virtually identical. There is one small difference as to whether it joins against case_data or court first. But 'court' is very tiny (small enough to use a seqscan instead of index scan) I'm a little surprised with court being this small that it doesn't do something like a hash aggregation, but court takes no time anyway. The real problem is that your nested loop index time is *much* slower. Devel: -> Index Scan using lit_actor_speed on litigant_details (cost=0.00..3.96 rows=1 width=81) (actual time=4.788..4.812 rows=1 loops=5057) Production: -> Index Scan using lit_actor_speed on litigant_details (cost=0.00..5.63 rows=1 width=81) (actual time=3.355..3.364 rows=1 loops=5057) Devel: -> Index Scan using case_speed on case_data (cost=0.00..3.46 rows=1 width=26) (actual time=4.222..4.230 rows=1 loops=5052) Production: -> Index Scan using case_data_pkey on case_data (cost=0.00..5.31 rows=1 width=26) (actual time=1.897..1.904 rows=1 loops=5052) Notice that the actual per-row cost is as much as 1/2 less than on your devel box. As a test, can you do "time cat $index_file >/dev/null" a couple of times. And then determine the MB/s. Alternatively run vmstat in another shell. If the read/s doesn't change, then you know the "cat" is being served from RAM, and thus it really is cached. I can point you to REINDEX and CLUSTER, but if it is caching in ram, I honestly can't say why the per loop would be that much slower. Are both systems running the same postgres version? It sounds like it is different (since you say something about switching to 8.0). I doubt it, but you might try an 8.1devel version. ... >>Do you regularly vacuum analyze your tables? >>Just as a test, try running: >>set enable_nested_loop to off; > > not quite acceptable > Total runtime: 221486.149 ms > Well, the estimates are now at least closer (3k vs 5k instead of 1), and it is still choosing nested loops. So they probably are faster. I would still be interested in the actual EXPLAIN ANALYZE with nested loops disabled. It is possible that *some* of the nested loops are performing worse than they have to. But really, you have worse index speed, and that needs to be figured out. John =:->
Attachment
At 01:18 PM 8/19/2005, John A Meinel wrote: >Jeremiah Jahn wrote: > > Sorry about the formatting. > > > > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote: > > > >>Jeremiah Jahn wrote: > >> > >> > >... > > >>The expensive parts are the 4915 lookups into the litigant_details (each > >>one takes approx 4ms for a total of ~20s). > >>And then you do it again on case_data (average 3ms each * 4906 loops = > >>~15s). > > > > Is there some way to avoid this? > > > >Well, in general, 3ms for a single lookup seems really long. Maybe your >index is bloated by not vacuuming often enough. Do you tend to get a lot >of updates to litigant_details? Given that the average access time for a 15Krpm HD is in the 5.5-6ms range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case) table accesses is requiring a seek. This implies a poor match between physical layout and access pattern. If I understand correctly, the table should not be very fragmented given that this is a reasonably freshly loaded DB? That implies that the fields being looked up are not well sorted in the table compared to the query pattern. If the entire table could fit in RAM, this would be far less of a consideration. Failing that, the physical HD layout has to be improved or the query pattern has to be changed to reduce seeks. >There are a couple possibilities at this point. First, you can REINDEX >the appropriate index, and see if that helps. However, if this is a test >box, it sounds like you just did a dump and reload, which wouldn't have >bloat in an index. > >Another possibility. Is this the column that you usually use when >pulling information out of litigant_details? If so, you can CLUSTER >litigant_details on the appropriate index. This will help things be >close together that should be, which decreases the index lookup costs. > >However, if this is not the common column, then you probably will slow >down whatever other accesses you may have on this table. > >After CLUSTER, the current data will stay clustered, but new data will >not, so you have to continually CLUSTER, the same way that you might >VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as >expensive as a VACUUM FULL. Be aware of this, but it might vastly >improve your performance, so it would be worth it. CLUSTER can be a very large maintenance overhead/problem if the table(s) in question actually need to be "continually" re CLUSTER ed. If there is no better solution available, then you do what you have to, but it feels like there should be a better answer here. Perhaps the DB schema needs examining to see if it matches up well with its real usage? Ron Peacetree
Ron wrote: > At 01:18 PM 8/19/2005, John A Meinel wrote: > >> Jeremiah Jahn wrote: >> > Sorry about the formatting. >> > >> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote: >> > >> >>Jeremiah Jahn wrote: >> >> >> >> >> >> ... >> >> >>The expensive parts are the 4915 lookups into the litigant_details >> (each >> >>one takes approx 4ms for a total of ~20s). >> >>And then you do it again on case_data (average 3ms each * 4906 loops = >> >>~15s). >> > >> > Is there some way to avoid this? >> > >> >> Well, in general, 3ms for a single lookup seems really long. Maybe your >> index is bloated by not vacuuming often enough. Do you tend to get a lot >> of updates to litigant_details? > > > Given that the average access time for a 15Krpm HD is in the 5.5-6ms > range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single > lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case) > table accesses is requiring a seek. > Well, from what he has said, the total indexes are < 1GB and he has 6GB of ram. So everything should fit. Not to mention he is only accessing 5000/several million rows. > This implies a poor match between physical layout and access pattern. This seems to be the case. But since this is not the only query, it may be that other access patterns are more important to optimize for. > > If I understand correctly, the table should not be very fragmented given > that this is a reasonably freshly loaded DB? That implies that the > fields being looked up are not well sorted in the table compared to the > query pattern. > > If the entire table could fit in RAM, this would be far less of a > consideration. Failing that, the physical HD layout has to be improved > or the query pattern has to be changed to reduce seeks. > > ... >> After CLUSTER, the current data will stay clustered, but new data will >> not, so you have to continually CLUSTER, the same way that you might >> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as >> expensive as a VACUUM FULL. Be aware of this, but it might vastly >> improve your performance, so it would be worth it. > > > CLUSTER can be a very large maintenance overhead/problem if the table(s) > in question actually need to be "continually" re CLUSTER ed. > > If there is no better solution available, then you do what you have to, > but it feels like there should be a better answer here. > > Perhaps the DB schema needs examining to see if it matches up well with > its real usage? > > Ron Peacetree > I certainly agree that CLUSTER is expensive, and is an on-going maintenance issue. If it is the normal access pattern, though, it may be worth it. I also wonder, though, if his table is properly normalized. Which, as you mentioned, might lead to improved access patterns. John =:->
Attachment
On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote: > Jeremiah Jahn wrote: > > Sorry about the formatting. > > > > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote: > > > >>Jeremiah Jahn wrote: > >> > >> > > ... > > >>The expensive parts are the 4915 lookups into the litigant_details (each > >>one takes approx 4ms for a total of ~20s). > >>And then you do it again on case_data (average 3ms each * 4906 loops = > >>~15s). > > > > Is there some way to avoid this? > > > > Well, in general, 3ms for a single lookup seems really long. Maybe your > index is bloated by not vacuuming often enough. Do you tend to get a lot > of updates to litigant_details? I have vacuumed this already. I get lots of updates, but this data is mostly unchanging. > > There are a couple possibilities at this point. First, you can REINDEX > the appropriate index, and see if that helps. However, if this is a test > box, it sounds like you just did a dump and reload, which wouldn't have > bloat in an index. I loaded it using slony > > Another possibility. Is this the column that you usually use when > pulling information out of litigant_details? If so, you can CLUSTER > litigant_details on the appropriate index. This will help things be > close together that should be, which decreases the index lookup costs. clustering on this right now. Most of the other things are already clustered. name and case_data > > However, if this is not the common column, then you probably will slow > down whatever other accesses you may have on this table. > > After CLUSTER, the current data will stay clustered, but new data will > not, so you have to continually CLUSTER, the same way that you might > VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as > expensive as a VACUUM FULL. Be aware of this, but it might vastly > improve your performance, so it would be worth it. I generally re-cluster once a week. > > > > > > >>So there is no need for preloading your indexes on the identity table. > >>It is definitely not the bottleneck. > >> > >>So a few design bits, which may help your database. > >>Why is "actor_id" a text field instead of a number? > > > > This is simply due to the nature of the data. > > > > I'm just wondering if changing into a number, and using a number->name > lookup would be faster for you. It may not be. In general, I prefer to > use numbers for references. I may be over paranoid, but I know that some > locales are bad with string -> string comparisons. And since the data in > your database is stored as UNICODE, I'm not sure if it has to do any > translating or not. Again, something to consider, it may not make any > difference. I don't believe so. I initialze the DB as 'lang=C'. I used to have the problem where things were being inited as en_US. this would prevent any text based index from working. This doesn't seem to be the case here, so I'm not worried about it. > > > > > >>You could try creating an index on "litigant_details (actor_id, > >>count_ori)" so that it can do just an index lookup, rather than an index > >>+ filter. > > > > I have one, but it doesn't seem to like to use it. Don't really need it > > though, I can just drop the court_id out of the query. It's redundant, > > since each actor_id is also unique in litigant details. I had run vac > > full and analyze but I ran them again anyway and the planning improved. > > However, my 14 disk raid 10 array is still slower than my 3 disk raid 5 > > on my production box. 46sec vs 30sec (with live traffic on the > > production) One of the strange things is that when I run the cat command > > on my index and tables that are "HOT" it has no effect on memory usage. > > Right now I'm running ext3 on LVM. I'm still in a position to redo the > > file system and everything. Is this a good way to do it or should I > > switch to something else? What about stripe and extent sizes...? kernel > > parameters to change? > > Well, the plans are virtually identical. There is one small difference > as to whether it joins against case_data or court first. But 'court' is > very tiny (small enough to use a seqscan instead of index scan) I'm a > little surprised with court being this small that it doesn't do > something like a hash aggregation, but court takes no time anyway. > > The real problem is that your nested loop index time is *much* slower. > > Devel: > -> Index Scan using lit_actor_speed on litigant_details > (cost=0.00..3.96 rows=1 width=81) > (actual time=4.788..4.812 rows=1 loops=5057) > > Production: > -> Index Scan using lit_actor_speed on litigant_details > (cost=0.00..5.63 rows=1 width=81) > (actual time=3.355..3.364 rows=1 loops=5057) > > Devel: > -> Index Scan using case_speed on case_data > (cost=0.00..3.46 rows=1 width=26) > (actual time=4.222..4.230 rows=1 loops=5052) > > Production: > -> Index Scan using case_data_pkey on case_data > (cost=0.00..5.31 rows=1 width=26) > (actual time=1.897..1.904 rows=1 loops=5052) > > Notice that the actual per-row cost is as much as 1/2 less than on your > devel box. > > As a test, can you do "time cat $index_file >/dev/null" a couple of > times. And then determine the MB/s. > Alternatively run vmstat in another shell. If the read/s doesn't change, > then you know the "cat" is being served from RAM, and thus it really is > cached. it's cached alright. I'm getting a read rate of about 150MB/sec. I would have thought is would be faster with my raid setup. I think I'm going to scrap the whole thing and get rid of LVM. I'll just do a straight ext3 system. Maybe that will help. Still trying to get suggestions for a stripe size. > > I can point you to REINDEX and CLUSTER, but if it is caching in ram, I > honestly can't say why the per loop would be that much slower. > Are both systems running the same postgres version? It sounds like it is > different (since you say something about switching to 8.0). These had little or no effect. The production machine is running 7.4 while the devel machine is running 8.0 > I doubt it, but you might try an 8.1devel version. > > ... > > >>Do you regularly vacuum analyze your tables? > >>Just as a test, try running: > >>set enable_nested_loop to off; > > > > not quite acceptable > > Total runtime: 221486.149 ms > > > > Well, the estimates are now at least closer (3k vs 5k instead of 1), and > it is still choosing nested loops. So they probably are faster. > I would still be interested in the actual EXPLAIN ANALYZE with nested > loops disabled. It is possible that *some* of the nested loops are > performing worse than they have to. this is a cached version. > copa=> explain analyze select full_name,identity_id,identity.case_id,court.id,date_of_birth,assigned_case_role,litigant_details.impound_litigant_data > copa-> from identity > copa-> join litigant_details on identity.actor_id = litigant_details.actor_id > copa-> join case_data on litigant_details.case_id = case_data.case_id and litigant_details.court_ori = case_data.court_ori > copa-> join court on identity.court_ori = court.id > copa-> where identity.court_ori = 'IL081025J' and full_name like 'SMITH%' order by full_name; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Sort (cost=100502560.72..100502583.47 rows=9099 width=86) (actual time=17843.876..17849.401 rows=8094 loops=1) > Sort Key: identity.full_name > -> Merge Join (cost=100311378.72..100501962.40 rows=9099 width=86) (actual time=15195.816..17817.847 rows=8094 loops=1) > Merge Cond: ((("outer".court_ori)::text = "inner"."?column10?") AND (("outer".case_id)::text = "inner"."?column11?")) > -> Index Scan using case_speed on case_data (cost=0.00..170424.73 rows=3999943 width=26) (actual time=0.015..4540.525rows=3018284 loops=1) > -> Sort (cost=100311378.72..100311400.82 rows=8839 width=112) (actual time=9594.985..9601.174 rows=8094 loops=1) > Sort Key: (litigant_details.court_ori)::text, (litigant_details.case_id)::text > -> Nested Loop (cost=100002491.43..100310799.34 rows=8839 width=112) (actual time=6892.755..9555.828 rows=8094loops=1) > -> Seq Scan on court (cost=0.00..3.29 rows=1 width=12) (actual time=0.085..0.096 rows=1 loops=1) > Filter: ('IL081025J'::text = (id)::text) > -> Merge Join (cost=2491.43..310707.66 rows=8839 width=113) (actual time=6892.656..9519.680 rows=8094loops=1) > Merge Cond: (("outer".actor_id)::text = "inner"."?column7?") > -> Index Scan using lit_actor_speed on litigant_details (cost=0.00..295722.00 rows=4956820width=81) (actual time=0.027..5613.814 rows=3736703 loops=1) > -> Sort (cost=2491.43..2513.71 rows=8913 width=82) (actual time=116.071..122.272 rows=8100loops=1) > Sort Key: (identity.actor_id)::text > -> Index Scan using name_speed on identity (cost=0.00..1906.66 rows=8913 width=82) (actualtime=0.133..81.104 rows=8100 loops=1) > Index Cond: (((full_name)::text >= 'SMITH'::character varying) AND ((full_name)::text< 'SMITI'::character varying)) > Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~ 'SMITH%'::text)) > Total runtime: 17859.917 ms > But really, you have worse index speed, and that needs to be figured out. > > John > =:-> -- Speak softly and carry a +6 two-handed sword.
On Fri, 2005-08-19 at 14:23 -0500, John A Meinel wrote: > Ron wrote: > > At 01:18 PM 8/19/2005, John A Meinel wrote: > > > >> Jeremiah Jahn wrote: > >> > Sorry about the formatting. > >> > > >> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote: > >> > > >> >>Jeremiah Jahn wrote: > >> >> > >> >> > >> > >> ... > >> > >> >>The expensive parts are the 4915 lookups into the litigant_details > >> (each > >> >>one takes approx 4ms for a total of ~20s). > >> >>And then you do it again on case_data (average 3ms each * 4906 loops = > >> >>~15s). > >> > > >> > Is there some way to avoid this? > >> > > >> > >> Well, in general, 3ms for a single lookup seems really long. Maybe your > >> index is bloated by not vacuuming often enough. Do you tend to get a lot > >> of updates to litigant_details? > > > > > > Given that the average access time for a 15Krpm HD is in the 5.5-6ms > > range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single > > lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case) > > table accesses is requiring a seek. > > I think LVM may be a problem, since it also seems to break things up on the file system. My access time on the seek should be around 1/7th the 15Krpm I believe since it's a 14 disk raid 10 array. And no other traffic at the moment. > > > Well, from what he has said, the total indexes are < 1GB and he has 6GB > of ram. So everything should fit. Not to mention he is only accessing > 5000/several million rows. I table spaced some of the indexes and they are around 211066880 bytes for the name_speed index and 149825330 for the lit_actor_speed index tables seem to be about a gig. > > > > This implies a poor match between physical layout and access pattern. > > This seems to be the case. But since this is not the only query, it may > be that other access patterns are more important to optimize for. > > > > > If I understand correctly, the table should not be very fragmented given > > that this is a reasonably freshly loaded DB? That implies that the > > fields being looked up are not well sorted in the table compared to the > > query pattern. > > > > If the entire table could fit in RAM, this would be far less of a > > consideration. Failing that, the physical HD layout has to be improved > > or the query pattern has to be changed to reduce seeks. > > > > > > ... > > >> After CLUSTER, the current data will stay clustered, but new data will > >> not, so you have to continually CLUSTER, the same way that you might > >> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as > >> expensive as a VACUUM FULL. Be aware of this, but it might vastly > >> improve your performance, so it would be worth it. > > > > > > CLUSTER can be a very large maintenance overhead/problem if the table(s) > > in question actually need to be "continually" re CLUSTER ed. > > > > If there is no better solution available, then you do what you have to, > > but it feels like there should be a better answer here. > > > > Perhaps the DB schema needs examining to see if it matches up well with > > its real usage? > > > > Ron Peacetree > > > > I certainly agree that CLUSTER is expensive, and is an on-going > maintenance issue. If it is the normal access pattern, though, it may be > worth it. The query I've sent you is one of the most common I get just change the name. I handle about 180K of them a day mostly between 8 and 5. The clustering has never really been a problem. Like I said before I do it about once a week. I handle about 3000 update an hour consisting of about 1000-3000 statement per update. ie about 2.5 million updates per hour. In the last few months or so I've filtered these down to about 400K update/delete/insert statements per hour. > > I also wonder, though, if his table is properly normalized. Which, as > you mentioned, might lead to improved access patterns. The system is about as normalized as I can get it. In general the layout is the following: courts have cases, cases have litigant_details. Actors have identities and litigant_details. > > John > =:-> -- Speak softly and carry a +6 two-handed sword.
Rebuild in progress with just ext3 on the raid array...will see if this helps the access times. If it doesn't I'll mess with the stripe size. I have REINDEXED, CLUSTERED, tablespaced and cached with 'cat table/index > /dev/null' none of this seems to have helped, or even increased my memory usage. argh! The only thing about this new system that I'm unfamiliar with is the array setup and LVM, which is why I think that's where the issue is. clustering and indexing as well as vacuum etc are things that I do and have been aware of for sometime. Perhaps slony is a factor, but I really don't see it causing problems on index read speed esp. when it's not running. thanx for your help, I really appreciate it. -jj- On Fri, 2005-08-19 at 14:23 -0500, John A Meinel wrote: > Ron wrote: > > At 01:18 PM 8/19/2005, John A Meinel wrote: > > > >> Jeremiah Jahn wrote: > >> > Sorry about the formatting. > >> > > >> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote: > >> > > >> >>Jeremiah Jahn wrote: > >> >> > >> >> > >> > >> ... > >> > >> >>The expensive parts are the 4915 lookups into the litigant_details > >> (each > >> >>one takes approx 4ms for a total of ~20s). > >> >>And then you do it again on case_data (average 3ms each * 4906 loops = > >> >>~15s). > >> > > >> > Is there some way to avoid this? > >> > > >> > >> Well, in general, 3ms for a single lookup seems really long. Maybe your > >> index is bloated by not vacuuming often enough. Do you tend to get a lot > >> of updates to litigant_details? > > > > > > Given that the average access time for a 15Krpm HD is in the 5.5-6ms > > range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single > > lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case) > > table accesses is requiring a seek. > > > > > Well, from what he has said, the total indexes are < 1GB and he has 6GB > of ram. So everything should fit. Not to mention he is only accessing > 5000/several million rows. > > > > This implies a poor match between physical layout and access pattern. > > This seems to be the case. But since this is not the only query, it may > be that other access patterns are more important to optimize for. > > > > > If I understand correctly, the table should not be very fragmented given > > that this is a reasonably freshly loaded DB? That implies that the > > fields being looked up are not well sorted in the table compared to the > > query pattern. > > > > If the entire table could fit in RAM, this would be far less of a > > consideration. Failing that, the physical HD layout has to be improved > > or the query pattern has to be changed to reduce seeks. > > > > > > ... > > >> After CLUSTER, the current data will stay clustered, but new data will > >> not, so you have to continually CLUSTER, the same way that you might > >> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as > >> expensive as a VACUUM FULL. Be aware of this, but it might vastly > >> improve your performance, so it would be worth it. > > > > > > CLUSTER can be a very large maintenance overhead/problem if the table(s) > > in question actually need to be "continually" re CLUSTER ed. > > > > If there is no better solution available, then you do what you have to, > > but it feels like there should be a better answer here. > > > > Perhaps the DB schema needs examining to see if it matches up well with > > its real usage? > > > > Ron Peacetree > > > > I certainly agree that CLUSTER is expensive, and is an on-going > maintenance issue. If it is the normal access pattern, though, it may be > worth it. > > I also wonder, though, if his table is properly normalized. Which, as > you mentioned, might lead to improved access patterns. > > John > =:-> -- Speak softly and carry a +6 two-handed sword.
Jeremiah Jahn wrote: > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote: > >>Jeremiah Jahn wrote: >> ... >> >>Well, in general, 3ms for a single lookup seems really long. Maybe your >>index is bloated by not vacuuming often enough. Do you tend to get a lot >>of updates to litigant_details? > > I have vacuumed this already. I get lots of updates, but this data is > mostly unchanging. > > >>There are a couple possibilities at this point. First, you can REINDEX >>the appropriate index, and see if that helps. However, if this is a test >>box, it sounds like you just did a dump and reload, which wouldn't have >>bloat in an index. > > > I loaded it using slony I don't know that slony versus pg_dump/pg_restore really matters. The big thing is that Updates wouldn't be trashing your index. But if you are saying that you cluster once/wk your index can't be that messed up anyway. (Unless CLUSTER messes up the non-clustered indexes, but that would make cluster much less useful, so I would have guessed this was not the case) > > >>Another possibility. Is this the column that you usually use when >>pulling information out of litigant_details? If so, you can CLUSTER >>litigant_details on the appropriate index. This will help things be >>close together that should be, which decreases the index lookup costs. > > clustering on this right now. Most of the other things are already > clustered. name and case_data Just as a reality check, they are clustered on the columns in question, right? (I don't know if this column is a primary key or not, but any index can be used for clustering). > > >>However, if this is not the common column, then you probably will slow >>down whatever other accesses you may have on this table. >> >>After CLUSTER, the current data will stay clustered, but new data will >>not, so you have to continually CLUSTER, the same way that you might >>VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as >>expensive as a VACUUM FULL. Be aware of this, but it might vastly >>improve your performance, so it would be worth it. > > I generally re-cluster once a week. > >>> >>>>So there is no need for preloading your indexes on the identity table. >>>>It is definitely not the bottleneck. >>>> >>>>So a few design bits, which may help your database. >>>>Why is "actor_id" a text field instead of a number? >>> >>>This is simply due to the nature of the data. >>> >> >>I'm just wondering if changing into a number, and using a number->name >>lookup would be faster for you. It may not be. In general, I prefer to >>use numbers for references. I may be over paranoid, but I know that some >>locales are bad with string -> string comparisons. And since the data in >>your database is stored as UNICODE, I'm not sure if it has to do any >>translating or not. Again, something to consider, it may not make any >>difference. > > I don't believe so. I initialze the DB as 'lang=C'. I used to have the > problem where things were being inited as en_US. this would prevent any > text based index from working. This doesn't seem to be the case here, so > I'm not worried about it. > Sorry, I think I was confusing you with someone else who posted SHOW ALL. > > >> ... > it's cached alright. I'm getting a read rate of about 150MB/sec. I would > have thought is would be faster with my raid setup. I think I'm going to > scrap the whole thing and get rid of LVM. I'll just do a straight ext3 > system. Maybe that will help. Still trying to get suggestions for a > stripe size. > I don't think 150MB/s is out of the realm for a 14 drive array. How fast is time dd if=/dev/zero of=testfile bs=8192 count=1000000 (That should create a 8GB file, which is too big to cache everything) And then how fast is: time dd if=testfile of=/dev/null bs=8192 count=1000000 That should give you a semi-decent way of measuring how fast the RAID system is, since it should be too big to cache in ram. > >>I can point you to REINDEX and CLUSTER, but if it is caching in ram, I >>honestly can't say why the per loop would be that much slower. >>Are both systems running the same postgres version? It sounds like it is >>different (since you say something about switching to 8.0). > > These had little or no effect. > The production machine is running 7.4 while the devel machine is running > 8.0 > Well, my concern is that maybe some portion of the 8.0 code actually slowed things down for you. You could try reverting to 7.4 on the devel box, though I think playing with upgrading to 8.1 might be more worthwhile. ... > > this is a cached version. > I assume that you mean this is the second run of the query. I can't compare it too much, since this is "smith" rather than "jones". But this one is 17s rather than the other one being 46s. And that includes having 8k rows instead of having 5k rows. Have you tried other values with disabled nested loops? Because this query (at least in cached form) seems to be *way* faster than with nested loops. I know that you somehow managed to get 200s in your testing, but it might just be that whatever needed to be loaded is now loaded, and you would get better performance. If this is true, it means you might need to tweak some settings, and make sure your statistics are decent, so that postgres can actually pick the optimal plan. > >>copa=> explain analyze select full_name,identity_id,identity.case_id,court.id,date_of_birth,assigned_case_role,litigant_details.impound_litigant_data >>copa-> from identity >>copa-> join litigant_details on identity.actor_id = litigant_details.actor_id >>copa-> join case_data on litigant_details.case_id = case_data.case_id and litigant_details.court_ori = case_data.court_ori >>copa-> join court on identity.court_ori = court.id >>copa-> where identity.court_ori = 'IL081025J' and full_name like 'SMITH%' order by full_name; >> QUERY PLAN >>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> Sort (cost=100502560.72..100502583.47 rows=9099 width=86) (actual time=17843.876..17849.401 rows=8094 loops=1) >> Sort Key: identity.full_name >> -> Merge Join (cost=100311378.72..100501962.40 rows=9099 width=86) (actual time=15195.816..17817.847 rows=8094 loops=1) >> Merge Cond: ((("outer".court_ori)::text = "inner"."?column10?") AND (("outer".case_id)::text = "inner"."?column11?")) >> -> Index Scan using case_speed on case_data (cost=0.00..170424.73 rows=3999943 width=26) (actual time=0.015..4540.525rows=3018284 loops=1) >> -> Sort (cost=100311378.72..100311400.82 rows=8839 width=112) (actual time=9594.985..9601.174 rows=8094 loops=1) >> Sort Key: (litigant_details.court_ori)::text, (litigant_details.case_id)::text >> -> Nested Loop (cost=100002491.43..100310799.34 rows=8839 width=112) (actual time=6892.755..9555.828 rows=8094loops=1) >> -> Seq Scan on court (cost=0.00..3.29 rows=1 width=12) (actual time=0.085..0.096 rows=1 loops=1) >> Filter: ('IL081025J'::text = (id)::text) What I don't really understand is the next part. It seems to be doing an index scan on 3.7M rows, and getting very decent performance (5s), and then merging against a table which returns only 8k rows. Why is it having to look through all of those rows? I may be missing something, but this says it is able to do 600 index lookups / millisecond. Which seems superfast. (Compared to your earlier 4ms / lookup) Something fishy is going on here. >> -> Merge Join (cost=2491.43..310707.66 rows=8839 width=113) (actual time=6892.656..9519.680 rows=8094loops=1) >> Merge Cond: (("outer".actor_id)::text = "inner"."?column7?") >> -> Index Scan using lit_actor_speed on litigant_details (cost=0.00..295722.00 rows=4956820width=81) (actual time=0.027..5613.814 rows=3736703 loops=1) >> -> Sort (cost=2491.43..2513.71 rows=8913 width=82) (actual time=116.071..122.272 rows=8100loops=1) >> Sort Key: (identity.actor_id)::text >> -> Index Scan using name_speed on identity (cost=0.00..1906.66 rows=8913 width=82) (actualtime=0.133..81.104 rows=8100 loops=1) >> Index Cond: (((full_name)::text >= 'SMITH'::character varying) AND ((full_name)::text< 'SMITI'::character varying)) >> Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~ 'SMITH%'::text)) >> Total runtime: 17859.917 ms > > >>But really, you have worse index speed, and that needs to be figured out. >> >>John >>=:-> I'm assuming your data is private (since it looks like legal stuff). Unless maybe that makes it part of the public record. Anyway, I'm not able to, but sometimes someone like Tom can profile stuff to see what is going on. I might just be messing up my ability to read the explain output. But somehow things don't seem to be lining up with the cost of a single index lookup. On my crappy Celeron 450 box, an index lookup is 0.06ms once things are cached in ram. John =:->
Attachment
Jeremiah Jahn wrote: > Rebuild in progress with just ext3 on the raid array...will see if this > helps the access times. If it doesn't I'll mess with the stripe size. I > have REINDEXED, CLUSTERED, tablespaced and cached with 'cat table/index > >>/dev/null' none of this seems to have helped, or even increased my > > memory usage. argh! The only thing about this new system that I'm > unfamiliar with is the array setup and LVM, which is why I think that's > where the issue is. clustering and indexing as well as vacuum etc are > things that I do and have been aware of for sometime. Perhaps slony is a > factor, but I really don't see it causing problems on index read speed > esp. when it's not running. > > thanx for your help, I really appreciate it. > -jj- > By the way, how are you measuring memory usage? Can you give the output of that command, just to make sure you are reading it correctly. John =:->
Attachment
On Aug 19, 2005, at 3:01 PM, Jeremiah Jahn wrote: > Rebuild in progress with just ext3 on the raid array...will see if > this > helps the access times. From my recent experiences, I can say ext3 is probably not a great choice for Pg databases. If you check the archives you'll see there's a lot of discussion about various journalling filesystems and ext3 usually(always?) comes up on the bottom as far as performance goes. If you insist on using it, I would at least recommend the noatime option in fstab and using data=writeback to get the faster of the journal modes. XFS seems to be a trusted choice, followed by Reiser and JFS both with the occasional controversy when the comparisons pop up. -Dan
Dan Harris wrote: > From my recent experiences, I can say ext3 is probably not a great > choice for Pg databases. If you check the archives you'll see > there's a lot of discussion about various journalling filesystems and > ext3 usually(always?) comes up on the bottom as far as performance > goes. If you insist on using it, I would at least recommend the > noatime option in fstab and using data=writeback to get the faster Based on my knoledge, Ext3 is good with keeping filesystem integrity AND data integrity while pressing the reset button. However, by selecting data=writeback, you gain more speed, but you risk the data integrity during a crash: Ext3 garantees only filesystem integrity. This means with database transaction logs: The last transactions are not guaranteed to be written into the hard drives during a hardware reset, meaning of a loss of some committed transactions. Reiserfs is known to do things this false way also. Is there a way with a Reiserfs filesystem to fulfill both filesystem AND data integrity requirements nowadays? See for example "man mount" to see the effects of data=journal, data=ordered(default) and data=writeback for Ext3. Only the writeback risks data integrity. Ext3 is the only journaled filesystem, that I know that fulfills these fundamental data integrity guarantees. Personally I like about such filesystems, even though it means less speed. Marko Ristola
On Sat, Aug 20, 2005 at 01:12:15AM -0600, Dan Harris wrote: >XFS seems to be a trusted choice, followed by Reiser and JFS both >with the occasional controversy when the comparisons pop up. And don't put the xlog on a journaled filesystem. There is no advantage to doing so, and it will slow things down. (Assuming a sane seperate xlog partition configuration, sized reasonably.) Mike Stone
On Sat, Aug 20, 2005 at 02:17:54PM +0300, Marko Ristola wrote: >Based on my knoledge, Ext3 is good with keeping filesystem integrity >AND data integrity while pressing the reset button. However, by >selecting data=writeback, you gain more speed, but you risk the data >integrity during a crash: Ext3 garantees only filesystem integrity. That's why postgres keeps its own transaction log. Any of these filesystems guarantee data integrity for data that's been synced to disk, and postgres keeps track of what data has and has not been committed so it can recover gracefully from a failure. That's why most filesystems are designed the way they are; the application can determine what things need better data integrity and which need better performance on a case-by-case basis. Mike Stone
Michael Stone <mstone+postgres@mathom.us> writes: > On Sat, Aug 20, 2005 at 02:17:54PM +0300, Marko Ristola wrote: >> Based on my knoledge, Ext3 is good with keeping filesystem integrity >> AND data integrity while pressing the reset button. However, by >> selecting data=writeback, you gain more speed, but you risk the data >> integrity during a crash: Ext3 garantees only filesystem integrity. > That's why postgres keeps its own transaction log. Any of these > filesystems guarantee data integrity for data that's been synced to > disk, and postgres keeps track of what data has and has not been > committed so it can recover gracefully from a failure. Right. I think the optimal setting for a Postgres data directory is journaled metadata, non-journaled file content. Postgres can take care of the data integrity for itself, but it does assume that the filesystem stays structurally sane (eg, data blocks don't get reassigned to the wrong file), so you need a filesystem guarantee about the metadata. WAL files are handled in a much more conservative way (created, filled with zeroes, and fsync'd before we ever put any valuable data in 'em). If you have WAL on its own drive then I think Mike's recommendation of no filesystem journalling at all for that drive is probably OK. Or you can do same as above (journal metadata only) if you want a little extra protection. And of course all this reasoning depends on the assumption that the drive tells the truth about write-completion. If the drive does write caching it had better be able to complete all its accepted writes before dying in a power failure. (Hence, battery-backed write cache is OK, any other kind is evil.) regards, tom lane
At 04:11 PM 8/19/2005, Jeremiah Jahn wrote: >On Fri, 2005-08-19 at 14:23 -0500, John A Meinel wrote: > > Ron wrote: > > > At 01:18 PM 8/19/2005, John A Meinel wrote: > > > > > >> Jeremiah Jahn wrote: > > >> > Sorry about the formatting. > > >> > > > >> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote: > > >> > > > >> >>Jeremiah Jahn wrote: > > >> >> > > >> >> > > >> > > >> ... > > >> > > >> >>The expensive parts are the 4915 lookups into the litigant_details > > >> (each > > >> >>one takes approx 4ms for a total of ~20s). > > >> >>And then you do it again on case_data (average 3ms each * 4906 loops = > > >> >>~15s). > > >> > > > >> > Is there some way to avoid this? > > >> > > > >> > > >> Well, in general, 3ms for a single lookup seems really long. Maybe your > > >> index is bloated by not vacuuming often enough. Do you tend to get a lot > > >> of updates to litigant_details? > > > > > > > > > Given that the average access time for a 15Krpm HD is in the 5.5-6ms > > > range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single > > > lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case) > > > table accesses is requiring a seek. > > > >I think LVM may be a problem, since it also seems to break things up on >the file system. My access time on the seek should be around 1/7th the >15Krpm I believe since it's a 14 disk raid 10 array. And no other >traffic at the moment. Oops. There's a misconception here. RAID arrays increase _throughput_ AKA _bandwidth_ through parallel access to HDs. OTOH, access time is _latency_, and that is not changed. Access time for a RAID set is equal to that of the slowest access time, AKA highest latency, HD in the RAID set. > > Well, from what he has said, the total indexes are < 1GB and he has 6GB > > of ram. So everything should fit. Not to mention he is only accessing > > 5000/several million rows. >I table spaced some of the indexes and they are around 211066880 bytes >for the name_speed index and 149825330 for the lit_actor_speed index >tables seem to be about a gig. Hmm. And you think you are only using 250MB out of your 6GB of RAM? Something doesn't seem to add up here. From what's been posted, I'd expect much more RAM to be in use. > > > This implies a poor match between physical layout and access pattern. > > > > This seems to be the case. But since this is not the only query, it may > > be that other access patterns are more important to optimize for. > > > > > > > > If I understand correctly, the table should not be very fragmented given > > > that this is a reasonably freshly loaded DB? That implies that the > > > fields being looked up are not well sorted in the table compared to the > > > query pattern. > > > > > > If the entire table could fit in RAM, this would be far less of a > > > consideration. Failing that, the physical HD layout has to be improved > > > or the query pattern has to be changed to reduce seeks. > > > > > > > > > > ... > > > > >> After CLUSTER, the current data will stay clustered, but new data will > > >> not, so you have to continually CLUSTER, the same way that you might > > >> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as > > >> expensive as a VACUUM FULL. Be aware of this, but it might vastly > > >> improve your performance, so it would be worth it. > > > > > > > > > CLUSTER can be a very large maintenance overhead/problem if the table(s) > > > in question actually need to be "continually" re CLUSTER ed. > > > > > > If there is no better solution available, then you do what you have to, > > > but it feels like there should be a better answer here. > > > > > > Perhaps the DB schema needs examining to see if it matches up well with > > > its real usage? > > > > > > Ron Peacetree > > > > > > > I certainly agree that CLUSTER is expensive, and is an on-going > > maintenance issue. If it is the normal access pattern, though, it may be > > worth it. > >The query I've sent you is one of the most common I get just change the >name. I handle about 180K of them a day mostly between 8 and 5. The >clustering has never really been a problem. Like I said before I do it >about once a week. I handle about 3000 update an hour consisting of >about 1000-3000 statement per update. ie about 2.5 million updates per >hour. In the last few months or so I've filtered these down to about >400K update/delete/insert statements per hour. 2.5M updates per hour = ~695 updates per second. 400K per hour = ~112 updates per sec. These should be well within the capabilities of a RAID 10 subsystem based on 14 15Krpm HDs assuming a decent RAID card. What is the exact HW of the RAID subsystem involved and how is it configured? You shouldn't be having a performance problem AFAICT... > > I also wonder, though, if his table is properly normalized. > Which, as you mentioned, might lead to improved access patterns. >The system is about as normalized as I can get it. In general the >layout is the following: >courts have cases, cases have litigant_details. Actors have >identities and litigant_details. Hmmm. Can you tell us more about the actual schema, I may have an idea... > > > > John > > =:-> >-- >Speak softly and carry a +6 two-handed sword. Nah. A wand of 25th level automatic Magic Missile Fire ;-) Ron Peacetree
I'm just watching gnome-system-monoitor. Which after careful consideration.....and looking at dstat means I'm on CRACK....GSM isn't showing cached memory usage....I asume that the cache memory usage is where data off of the disks would be cached...? memory output from dstat is this for a few seconds: ---procs--- ------memory-usage----- ---paging-- --disk/sda----disk/sdb- ----swap--- ----total-cpu-usage---- run blk new|_used _buff _cach _free|__in_ _out_|_read write:_read write|_used _free|usr sys idl wai hiq siq 0 0 0|1336M 10M 4603M 17M| 490B 833B|3823B 3503k:1607k 4285k| 160k 2048M| 4 1 89 7 0 0 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 0 : 0 464k| 160k 2048M| 25 0 75 0 0 0 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 0 : 0 0 | 160k 2048M| 25 0 75 0 0 0 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 48k: 0 0 | 160k 2048M| 25 0 75 0 0 0 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 0 : 0 0 | 160k 2048M| 25 0 75 0 0 0 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 132k: 0 0 | 160k 2048M| 25 0 75 0 0 0 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 36k: 0 0 | 160k 2048M| 25 0 75 0 0 0 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 0 : 0 0 | 160k 2048M| 25 0 75 0 0 0 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 12k: 0 0 | 160k 2048M| 25 0 75 0 0 0 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 0 : 0 0 | 160k 2048M| 25 0 75 0 0 0 2 0 0|1353M 10M 4585M 18M| 0 0 | 0 0 : 0 0 | 160k 2048M| 25 1 75 0 0 0 1 0 0|1321M 10M 4616M 19M| 0 0 | 0 0 : 0 0 | 160k 2048M| 18 8 74 0 0 0 1 0 0|1326M 10M 4614M 17M| 0 0 | 0 0 :4096B 0 | 160k 2048M| 16 10 74 1 0 0 1 0 0|1330M 10M 4609M 17M| 0 0 | 0 12k:4096B 0 | 160k 2048M| 17 9 74 0 0 0 0 1 0|1343M 10M 4596M 17M| 0 0 | 0 0 : 0 316M| 160k 2048M| 5 10 74 11 0 1 0 1 0|1339M 10M 4596M 21M| 0 0 | 0 0 : 0 0 | 160k 2048M| 0 0 74 25 0 1 0 2 0|1334M 10M 4596M 25M| 0 0 | 0 4096B: 0 0 | 160k 2048M| 0 0 54 44 0 1 1 0 0|1326M 10M 4596M 34M| 0 0 | 0 0 : 0 364k| 160k 2048M| 4 1 60 34 0 1 1 0 0|1290M 10M 4596M 70M| 0 0 | 0 12k: 0 0 | 160k 2048M| 24 1 75 0 0 0 1 0 0|1301M 10M 4596M 59M| 0 0 | 0 20k: 0 0 | 160k 2048M| 21 4 75 0 0 0 1 0 0|1312M 10M 4596M 48M| 0 0 | 0 0 : 0 0 | 160k 2048M| 22 4 75 0 0 0 1 0 0|1323M 10M 4596M 37M| 0 0 | 0 0 : 0 24k| 160k 2048M| 21 4 75 0 0 0 1 0 0|1334M 10M 4596M 25M| 0 0 | 0 0 : 0 56k| 160k 2048M| 21 4 75 0 0 0 On Fri, 2005-08-19 at 16:07 -0500, John A Meinel wrote: > Jeremiah Jahn wrote: > > Rebuild in progress with just ext3 on the raid array...will see if this > > helps the access times. If it doesn't I'll mess with the stripe size. I > > have REINDEXED, CLUSTERED, tablespaced and cached with 'cat table/index > > > >>/dev/null' none of this seems to have helped, or even increased my > > > > memory usage. argh! The only thing about this new system that I'm > > unfamiliar with is the array setup and LVM, which is why I think that's > > where the issue is. clustering and indexing as well as vacuum etc are > > things that I do and have been aware of for sometime. Perhaps slony is a > > factor, but I really don't see it causing problems on index read speed > > esp. when it's not running. > > > > thanx for your help, I really appreciate it. > > -jj- > > > > By the way, how are you measuring memory usage? Can you give the output > of that command, just to make sure you are reading it correctly. > > John > =:-> > -- Speak softly and carry a +6 two-handed sword.
On Sat, 2005-08-20 at 11:59 -0400, Ron wrote: > At 04:11 PM 8/19/2005, Jeremiah Jahn wrote: > >On Fri, 2005-08-19 at 14:23 -0500, John A Meinel wrote: > > > Ron wrote: > > > > At 01:18 PM 8/19/2005, John A Meinel wrote: > > > > > > > >> Jeremiah Jahn wrote: > > > >> > Sorry about the formatting. > > > >> > > > > >> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote: > > > >> > > > > >> >>Jeremiah Jahn wrote: > > > >> >> > > > >> >> > > > >> > > > >> ... > > > >> > > > >> >>The expensive parts are the 4915 lookups into the litigant_details > > > >> (each > > > >> >>one takes approx 4ms for a total of ~20s). > > > >> >>And then you do it again on case_data (average 3ms each * 4906 loops = > > > >> >>~15s). > > > >> > > > > >> > Is there some way to avoid this? > > > >> > > > > >> > > > >> Well, in general, 3ms for a single lookup seems really long. Maybe your > > > >> index is bloated by not vacuuming often enough. Do you tend to get a lot > > > >> of updates to litigant_details? > > > > > > > > > > > > Given that the average access time for a 15Krpm HD is in the 5.5-6ms > > > > range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single > > > > lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case) > > > > table accesses is requiring a seek. > > > > > >I think LVM may be a problem, since it also seems to break things up on > >the file system. My access time on the seek should be around 1/7th the > >15Krpm I believe since it's a 14 disk raid 10 array. And no other > >traffic at the moment. > > Oops. There's a misconception here. RAID arrays increase > _throughput_ AKA _bandwidth_ through parallel access to HDs. OTOH, > access time is _latency_, and that is not changed. Access time for a > RAID set is equal to that of the slowest access time, AKA highest > latency, HD in the RAID set. so I will max out at the 5.5-6ms rang for access time? > > > > Well, from what he has said, the total indexes are < 1GB and he has 6GB > > > of ram. So everything should fit. Not to mention he is only accessing > > > 5000/several million rows. > >I table spaced some of the indexes and they are around 211066880 bytes > >for the name_speed index and 149825330 for the lit_actor_speed index > >tables seem to be about a gig. > > Hmm. And you think you are only using 250MB out of your 6GB of > RAM? Something doesn't seem to add up here. From what's been > posted, I'd expect much more RAM to be in use. the cached memory usage is complete using up the rest of the memory. > > > > > > This implies a poor match between physical layout and access pattern. > > > > > > This seems to be the case. But since this is not the only query, it may > > > be that other access patterns are more important to optimize for. > > > > > > > > > > > If I understand correctly, the table should not be very fragmented given > > > > that this is a reasonably freshly loaded DB? That implies that the > > > > fields being looked up are not well sorted in the table compared to the > > > > query pattern. > > > > > > > > If the entire table could fit in RAM, this would be far less of a > > > > consideration. Failing that, the physical HD layout has to be improved > > > > or the query pattern has to be changed to reduce seeks. > > > > > > > > > > > > > > ... > > > > > > >> After CLUSTER, the current data will stay clustered, but new data will > > > >> not, so you have to continually CLUSTER, the same way that you might > > > >> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as > > > >> expensive as a VACUUM FULL. Be aware of this, but it might vastly > > > >> improve your performance, so it would be worth it. > > > > > > > > > > > > CLUSTER can be a very large maintenance overhead/problem if the table(s) > > > > in question actually need to be "continually" re CLUSTER ed. > > > > > > > > If there is no better solution available, then you do what you have to, > > > > but it feels like there should be a better answer here. > > > > > > > > Perhaps the DB schema needs examining to see if it matches up well with > > > > its real usage? > > > > > > > > Ron Peacetree > > > > > > > > > > I certainly agree that CLUSTER is expensive, and is an on-going > > > maintenance issue. If it is the normal access pattern, though, it may be > > > worth it. > > > >The query I've sent you is one of the most common I get just change the > >name. I handle about 180K of them a day mostly between 8 and 5. The > >clustering has never really been a problem. Like I said before I do it > >about once a week. I handle about 3000 update an hour consisting of > >about 1000-3000 statement per update. ie about 2.5 million updates per > >hour. In the last few months or so I've filtered these down to about > >400K update/delete/insert statements per hour. > > 2.5M updates per hour = ~695 updates per second. 400K per hour = > ~112 updates per sec. These should be well within the capabilities > of a RAID 10 subsystem based on 14 15Krpm HDs assuming a decent RAID > card. What is the exact HW of the RAID subsystem involved and how is > it configured? You shouldn't be having a performance problem AFAICT... dell perc4 with 14 drives and the each pair is raid 1 with spanning enabled across all of the pairs. It doesn't say raid 10...But it seem to be it. What else would you like to know? > > > > I also wonder, though, if his table is properly normalized. > > Which, as you mentioned, might lead to improved access patterns. > >The system is about as normalized as I can get it. In general the > >layout is the following: > >courts have cases, cases have litigant_details. Actors have > >identities and litigant_details. > > Hmmm. Can you tell us more about the actual schema, I may have an idea... In what format would you like it. What kind of things would you like to know..? I've probably missed a few things, but this is what running on the production box. There are no foreign keys. Cascading delete were far too slow. And having to determine the order of deletes was a pain in the but. CREATE TABLE actor ( actor_id character varying(50) NOT NULL, case_id character varying(50) DEFAULT '0'::character varying NOT NULL, court_ori character varying(18) NOT NULL, role_class_code character varying(50) NOT NULL ); CREATE TABLE identity ( identity_id character varying(50) NOT NULL, actor_id character varying(50) NOT NULL, case_id character varying(50) DEFAULT '0'::character varying NOT NULL, court_ori character varying(18) NOT NULL, identity_type character varying(10) NOT NULL, entity_type character varying(50), full_name character varying(60) NOT NULL, entity_acronym character varying(50), name_prefix character varying(50), first_name character varying(50), middle_name character varying(50), last_name character varying(50), name_suffix character varying(50), gender_code character varying(50), date_of_birth date, place_of_birth character varying(50), height character varying(50), height_unit character varying(50), weight character varying(50), weight_unit character varying(50), religion character varying(50), ethnicity character varying(50), citizenship_country character varying(50), hair_color character varying(50), eye_color character varying(50), scars_marks_tatto character varying(255), marital_status character varying(50) ); ALTER TABLE ONLY identity ALTER COLUMN full_name SET STATISTICS 1000; CREATE TABLE case_data ( case_id character varying(50) NOT NULL, court_ori character varying(18) NOT NULL, type_code character varying(50), subtype_code character varying(50), case_category character varying(50), case_title character varying(100), type_subtype_text character varying(255), case_year integer, extraction_datetime character varying(15) NOT NULL, update_date date NOT NULL, case_dom oid, data bytea ); CREATE TABLE litigant_details ( actor_id character varying(50) NOT NULL, case_id character varying(50) NOT NULL, court_ori character varying(18) NOT NULL, assigned_case_role character varying(50) NOT NULL, initial_file_date date, initial_close_date date, reopen_date date, reclose_date date, physical_file_location character varying(50), impound_litigant_data character varying(50), impound_litigant_minutes character varying(50), actor_type character varying(50) NOT NULL, conviction character varying(3) ); CREATE TABLE actor_identifier ( identity_id character varying(50) NOT NULL, actor_id character varying(50) NOT NULL, case_id character varying(50) DEFAULT '0'::character varying NOT NULL, court_ori character varying(18) NOT NULL, actor_identifier_type_code character varying(50) NOT NULL, actor_identifier_id character varying(50) NOT NULL ); CREATE TABLE actor_relationship ( litigant_actor_id character varying(50) NOT NULL, related_actor_id character varying(50) NOT NULL, case_id character varying(50) NOT NULL, court_ori character varying(18) NOT NULL, relationship_type character varying(50) NOT NULL ); CREATE INDEX lit_actor_speed ON litigant_details USING btree (actor_id); CREATE INDEX name_speed ON identity USING btree (full_name); ALTER TABLE identity CLUSTER ON name_speed; CREATE INDEX case_speed ON case_data USING btree (court_ori, case_id); ALTER TABLE case_data CLUSTER ON case_speed; ALTER TABLE ONLY actor ADD CONSTRAINT actor_pkey PRIMARY KEY (court_ori, case_id, actor_id); ALTER TABLE ONLY identity ADD CONSTRAINT identity_pkey PRIMARY KEY (court_ori, case_id, identity_id, actor_id); ALTER TABLE ONLY case_data ADD CONSTRAINT case_data_pkey PRIMARY KEY (court_ori, case_id); ALTER TABLE ONLY litigant_details ADD CONSTRAINT litigant_details_pkey PRIMARY KEY (actor_id, case_id, court_ori); > > > > > > > John > > > =:-> > >-- > >Speak softly and carry a +6 two-handed sword. > > Nah. A wand of 25th level automatic Magic Missile Fire ;-) > > Ron Peacetree > -- Speak softly and carry a +6 two-handed sword.
On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote: > Jeremiah Jahn wrote: > > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote: > > > >>Jeremiah Jahn wrote: > >> > > > ... > > >> > >>Well, in general, 3ms for a single lookup seems really long. Maybe your > >>index is bloated by not vacuuming often enough. Do you tend to get a lot > >>of updates to litigant_details? > > > > I have vacuumed this already. I get lots of updates, but this data is > > mostly unchanging. > > > > > >>There are a couple possibilities at this point. First, you can REINDEX > >>the appropriate index, and see if that helps. However, if this is a test > >>box, it sounds like you just did a dump and reload, which wouldn't have > >>bloat in an index. > > > > > > I loaded it using slony > > I don't know that slony versus pg_dump/pg_restore really matters. The > big thing is that Updates wouldn't be trashing your index. > But if you are saying that you cluster once/wk your index can't be that > messed up anyway. (Unless CLUSTER messes up the non-clustered indexes, > but that would make cluster much less useful, so I would have guessed > this was not the case) > > > > > > >>Another possibility. Is this the column that you usually use when > >>pulling information out of litigant_details? If so, you can CLUSTER > >>litigant_details on the appropriate index. This will help things be > >>close together that should be, which decreases the index lookup costs. > > > > clustering on this right now. Most of the other things are already > > clustered. name and case_data > > Just as a reality check, they are clustered on the columns in question, > right? (I don't know if this column is a primary key or not, but any > index can be used for clustering). > > > > > > >>However, if this is not the common column, then you probably will slow > >>down whatever other accesses you may have on this table. > >> > >>After CLUSTER, the current data will stay clustered, but new data will > >>not, so you have to continually CLUSTER, the same way that you might > >>VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as > >>expensive as a VACUUM FULL. Be aware of this, but it might vastly > >>improve your performance, so it would be worth it. > > > > I generally re-cluster once a week. > > > >>> > >>>>So there is no need for preloading your indexes on the identity table. > >>>>It is definitely not the bottleneck. > >>>> > >>>>So a few design bits, which may help your database. > >>>>Why is "actor_id" a text field instead of a number? > >>> > >>>This is simply due to the nature of the data. > >>> > >> > >>I'm just wondering if changing into a number, and using a number->name > >>lookup would be faster for you. It may not be. In general, I prefer to > >>use numbers for references. I may be over paranoid, but I know that some > >>locales are bad with string -> string comparisons. And since the data in > >>your database is stored as UNICODE, I'm not sure if it has to do any > >>translating or not. Again, something to consider, it may not make any > >>difference. > > > > I don't believe so. I initialze the DB as 'lang=C'. I used to have the > > problem where things were being inited as en_US. this would prevent any > > text based index from working. This doesn't seem to be the case here, so > > I'm not worried about it. > > > > Sorry, I think I was confusing you with someone else who posted SHOW ALL. > > > > > > >> > > ... > > > it's cached alright. I'm getting a read rate of about 150MB/sec. I would > > have thought is would be faster with my raid setup. I think I'm going to > > scrap the whole thing and get rid of LVM. I'll just do a straight ext3 > > system. Maybe that will help. Still trying to get suggestions for a > > stripe size. > > > > I don't think 150MB/s is out of the realm for a 14 drive array. > How fast is > time dd if=/dev/zero of=testfile bs=8192 count=1000000 time dd if=/dev/zero of=testfile bs=8192 count=1000000 1000000+0 records in 1000000+0 records out real 1m24.248s user 0m0.381s sys 0m33.028s > (That should create a 8GB file, which is too big to cache everything) > And then how fast is: > time dd if=testfile of=/dev/null bs=8192 count=1000000 time dd if=testfile of=/dev/null bs=8192 count=1000000 1000000+0 records in 1000000+0 records out real 0m54.139s user 0m0.326s sys 0m8.916s and on a second run: real 0m55.667s user 0m0.341s sys 0m9.013s > > That should give you a semi-decent way of measuring how fast the RAID > system is, since it should be too big to cache in ram. about 150MB/Sec. Is there no better way to make this go faster...? > > > > >>I can point you to REINDEX and CLUSTER, but if it is caching in ram, I > >>honestly can't say why the per loop would be that much slower. > >>Are both systems running the same postgres version? It sounds like it is > >>different (since you say something about switching to 8.0). > > > > These had little or no effect. > > The production machine is running 7.4 while the devel machine is running > > 8.0 > > > > Well, my concern is that maybe some portion of the 8.0 code actually > slowed things down for you. You could try reverting to 7.4 on the devel > box, though I think playing with upgrading to 8.1 might be more worthwhile. And the level of stability for 8.1? I started with 7.4 and it didn't really feel as fast as it should either. > > ... > > > > > this is a cached version. > > > > I assume that you mean this is the second run of the query. I can't > compare it too much, since this is "smith" rather than "jones". But this > one is 17s rather than the other one being 46s. > > And that includes having 8k rows instead of having 5k rows. > > Have you tried other values with disabled nested loops? Because this > query (at least in cached form) seems to be *way* faster than with > nested loops. > I know that you somehow managed to get 200s in your testing, but it > might just be that whatever needed to be loaded is now loaded, and you > would get better performance. > If this is true, it means you might need to tweak some settings, and > make sure your statistics are decent, so that postgres can actually pick > the optimal plan. > > > > >>copa=> explain analyze select full_name,identity_id,identity.case_id,court.id,date_of_birth,assigned_case_role,litigant_details.impound_litigant_data > >>copa-> from identity > >>copa-> join litigant_details on identity.actor_id = litigant_details.actor_id > >>copa-> join case_data on litigant_details.case_id = case_data.case_id and litigant_details.court_ori = case_data.court_ori > >>copa-> join court on identity.court_ori = court.id > >>copa-> where identity.court_ori = 'IL081025J' and full_name like 'SMITH%' order by full_name; > >> QUERY PLAN > >>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > >> Sort (cost=100502560.72..100502583.47 rows=9099 width=86) (actual time=17843.876..17849.401 rows=8094 loops=1) > >> Sort Key: identity.full_name > >> -> Merge Join (cost=100311378.72..100501962.40 rows=9099 width=86) (actual time=15195.816..17817.847 rows=8094 loops=1) > >> Merge Cond: ((("outer".court_ori)::text = "inner"."?column10?") AND (("outer".case_id)::text = "inner"."?column11?")) > >> -> Index Scan using case_speed on case_data (cost=0.00..170424.73 rows=3999943 width=26) (actual time=0.015..4540.525rows=3018284 loops=1) > >> -> Sort (cost=100311378.72..100311400.82 rows=8839 width=112) (actual time=9594.985..9601.174 rows=8094 loops=1) > >> Sort Key: (litigant_details.court_ori)::text, (litigant_details.case_id)::text > >> -> Nested Loop (cost=100002491.43..100310799.34 rows=8839 width=112) (actual time=6892.755..9555.828rows=8094 loops=1) > >> -> Seq Scan on court (cost=0.00..3.29 rows=1 width=12) (actual time=0.085..0.096 rows=1 loops=1) > >> Filter: ('IL081025J'::text = (id)::text) > > What I don't really understand is the next part. It seems to be doing an > index scan on 3.7M rows, and getting very decent performance (5s), and > then merging against a table which returns only 8k rows. > Why is it having to look through all of those rows? > I may be missing something, but this says it is able to do 600 index > lookups / millisecond. Which seems superfast. (Compared to your earlier > 4ms / lookup) > Makes me a little confused myself... > Something fishy is going on here. > > > >> -> Merge Join (cost=2491.43..310707.66 rows=8839 width=113) (actual time=6892.656..9519.680 rows=8094loops=1) > >> Merge Cond: (("outer".actor_id)::text = "inner"."?column7?") > >> -> Index Scan using lit_actor_speed on litigant_details (cost=0.00..295722.00 rows=4956820width=81) (actual time=0.027..5613.814 rows=3736703 loops=1) > >> -> Sort (cost=2491.43..2513.71 rows=8913 width=82) (actual time=116.071..122.272 rows=8100loops=1) > >> Sort Key: (identity.actor_id)::text > >> -> Index Scan using name_speed on identity (cost=0.00..1906.66 rows=8913 width=82)(actual time=0.133..81.104 rows=8100 loops=1) > >> Index Cond: (((full_name)::text >= 'SMITH'::character varying) AND ((full_name)::text< 'SMITI'::character varying)) > >> Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~ 'SMITH%'::text)) > >> Total runtime: 17859.917 ms > > > > > >>But really, you have worse index speed, and that needs to be figured out. > >> > >>John > >>=:-> > > I'm assuming your data is private (since it looks like legal stuff). > Unless maybe that makes it part of the public record. > Anyway, I'm not able to, but sometimes someone like Tom can profile > stuff to see what is going on. I've had tom on here before..:) not my devel box, but my production box a couple of years ago. > > I might just be messing up my ability to read the explain output. But > somehow things don't seem to be lining up with the cost of a single > index lookup. > On my crappy Celeron 450 box, an index lookup is 0.06ms once things are > cached in ram. > > John > =:-> > > -- Speak softly and carry a +6 two-handed sword.
At 02:53 PM 8/20/2005, Jeremiah Jahn wrote: >On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote: > > Jeremiah Jahn wrote: > > > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote: > > > ><snip> > > > > > it's cached alright. I'm getting a read rate of about 150MB/sec. I would > > > have thought is would be faster with my raid setup. I think I'm going to > > > scrap the whole thing and get rid of LVM. I'll just do a straight ext3 > > > system. Maybe that will help. Still trying to get suggestions for a > > > stripe size. > > > > > > > I don't think 150MB/s is out of the realm for a 14 drive array. > > How fast is time dd if=/dev/zero of=testfile bs=8192 count=1000000 > > >time dd if=/dev/zero of=testfile bs=8192 count=1000000 >1000000+0 records in >1000000+0 records out > >real 1m24.248s >user 0m0.381s >sys 0m33.028s > > > > (That should create a 8GB file, which is too big to cache everything) > > And then how fast is: > > time dd if=testfile of=/dev/null bs=8192 count=1000000 > >time dd if=testfile of=/dev/null bs=8192 count=1000000 >1000000+0 records in >1000000+0 records out > >real 0m54.139s >user 0m0.326s >sys 0m8.916s > > >and on a second run: > >real 0m55.667s >user 0m0.341s >sys 0m9.013s > > > > > > That should give you a semi-decent way of measuring how fast the RAID > > system is, since it should be too big to cache in ram. > >about 150MB/Sec. Is there no better way to make this go faster...? Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them doing raw sequential IO like this should be capable of at ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's, ~7*79MB/s= 553MB/s if using Fujitsu MAU's, and ~7*86MB/s= 602MB/s if using Maxtor Atlas 15K II's to devices external to the RAID array. _IF_ the controller setup is high powered enough to keep that kind of IO rate up. This will require a controller or controllers providing dual channel U320 bandwidth externally and quad channel U320 bandwidth internally. IOW, it needs a controller or controllers talking 64b 133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized IO buffer as well. AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic MegaRAID controllers. What I don't know is which exact one yours is, nor do I know if it (or any of the MegaRAID controllers) are high powered enough. Talk to your HW supplier to make sure you have controllers adequate to your HD's. ...and yes, your average access time will be in the 5.5ms - 6ms range when doing a physical seek. Even with RAID, you want to minimize seeks and maximize sequential IO when accessing them. Best to not go to HD at all ;-) Hope this helps, Ron Peacetree
I'm reposting this because my mailer hiccuped when I sent it the first time. If this results in a double post, I apologize. At 02:53 PM 8/20/2005, Jeremiah Jahn wrote: >On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote: > > Jeremiah Jahn wrote: > > > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote: > > > ><snip> > > > > > it's cached alright. I'm getting a read rate of about 150MB/sec. I would > > > have thought is would be faster with my raid setup. I think I'm going to > > > scrap the whole thing and get rid of LVM. I'll just do a straight ext3 > > > system. Maybe that will help. Still trying to get suggestions for a > > > stripe size. > > > > > > > I don't think 150MB/s is out of the realm for a 14 drive array. > > How fast is time dd if=/dev/zero of=testfile bs=8192 count=1000000 > > >time dd if=/dev/zero of=testfile bs=8192 count=1000000 >1000000+0 records in >1000000+0 records out > >real 1m24.248s >user 0m0.381s >sys 0m33.028s > > > > (That should create a 8GB file, which is too big to cache everything) > > And then how fast is: > > time dd if=testfile of=/dev/null bs=8192 count=1000000 > >time dd if=testfile of=/dev/null bs=8192 count=1000000 >1000000+0 records in >1000000+0 records out > >real 0m54.139s >user 0m0.326s >sys 0m8.916s > > >and on a second run: > >real 0m55.667s >user 0m0.341s >sys 0m9.013s > > > > > > That should give you a semi-decent way of measuring how fast the RAID > > system is, since it should be too big to cache in ram. > >about 150MB/Sec. Is there no better way to make this go faster...? Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them doing raw sequential IO like this should be capable of at ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's, ~7*79MB/s= 553MB/s if using Fujitsu MAU's, and ~7*86MB/s= 602MB/s if using Maxtor Atlas 15K II's to devices external to the RAID array. _IF_ the controller setup is high powered enough to keep that kind of IO rate up. This will require a controller or controllers providing dual channel U320 bandwidth externally and quad channel U320 bandwidth internally. IOW, it needs a controller or controllers talking 64b 133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized IO buffer as well. AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic MegaRAID controllers. What I don't know is which exact one yours is, nor do I know if it (or any of the MegaRAID controllers) are high powered enough. Talk to your HW supplier to make sure you have controllers adequate to your HD's. ...and yes, your average access time will be in the 5.5ms - 6ms range when doing a physical seek. Even with RAID, you want to minimize seeks and maximize sequential IO when accessing them. Best to not go to HD at all ;-) Hope this helps, Ron Peacetree
At 02:16 PM 8/20/2005, Jeremiah Jahn wrote: >I'm just watching gnome-system-monoitor. Which after careful >consideration.....and looking at dstat means I'm on CRACK....GSM isn't >showing cached memory usage....I asume that the cache memory usage >is where data off of the disks would be cached...? > >memory output from dstat is this for a few seconds: > >---procs--- ------memory-usage----- ---paging-- >--disk/sda----disk/sdb- ----swap--- ----total-cpu-usage---- >run blk new|_used _buff _cach _free|__in_ _out_|_read write:_read >write|_used _free|usr sys idl wai hiq siq > 0 0 0|1336M 10M 4603M 17M| 490B 833B|3823B 3503k:1607k > 4285k| 160k 2048M| 4 1 89 7 0 0 > 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 0 > : 0 464k| 160k 2048M| 25 0 75 0 0 0 ><snip> > 1 0 0|1334M 10M 4596M 25M| 0 0 | 0 0 > : 0 56k| 160k 2048M| 21 4 75 0 0 0 Then the "low memory usage" was a chimera. Excellent! Given the evidence in this thread, IMO you should upgrade your box to 16GB of RAM ASAP. That should be enough to cache most, if not all, of the 10GB of the "hot" part of your DB; thereby dedicating your HD subsystem as much as possible to writes (which is unavoidable HD IO). As I've posted before, at $75-$150/GB, it's well worth the investment whenever you can prove it will help as we have here. Hope this helps, Ron Peacetree
Jeremiah Jahn wrote: > I'm just watching gnome-system-monoitor. Which after careful > consideration.....and looking at dstat means I'm on CRACK....GSM isn't > showing cached memory usage....I asume that the cache memory usage is > where data off of the disks would be cached...? > Well a simple "free" also tells you how much has been cached. I believe by reading the _cach line, it looks like you have 4.6G cached. So you are indeed using memory. I'm still concerned why it seems to be taking 3-4ms per index lookup, when things should already be cached in RAM. Now, I may be wrong about whether the indexes are cached, but I sure would expect them to be. What is the time for a cached query on your system (with normal nested loops)? (give the EXPLAIN ANALYZE for the *second* run, or maybe the fourth). I'm glad that we aren't seeing something weird with your kernel, at least. John =:-> > > > memory output from dstat is this for a few seconds: > > ---procs--- ------memory-usage----- ---paging-- --disk/sda----disk/sdb- ----swap--- ----total-cpu-usage---- > run blk new|_used _buff _cach _free|__in_ _out_|_read write:_read write|_used _free|usr sys idl wai hiq siq > 0 0 0|1336M 10M 4603M 17M| 490B 833B|3823B 3503k:1607k 4285k| 160k 2048M| 4 1 89 7 0 0 > 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 0 : 0 464k| 160k 2048M| 25 0 75 0 0 0 > 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 0 : 0 0 | 160k 2048M| 25 0 75 0 0 0 > 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 48k: 0 0 | 160k 2048M| 25 0 75 0 0 0 > 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 0 : 0 0 | 160k 2048M| 25 0 75 0 0 0 > 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 132k: 0 0 | 160k 2048M| 25 0 75 0 0 0 > 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 36k: 0 0 | 160k 2048M| 25 0 75 0 0 0 > 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 0 : 0 0 | 160k 2048M| 25 0 75 0 0 0 > 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 12k: 0 0 | 160k 2048M| 25 0 75 0 0 0 > 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 0 : 0 0 | 160k 2048M| 25 0 75 0 0 0 > 2 0 0|1353M 10M 4585M 18M| 0 0 | 0 0 : 0 0 | 160k 2048M| 25 1 75 0 0 0 > 1 0 0|1321M 10M 4616M 19M| 0 0 | 0 0 : 0 0 | 160k 2048M| 18 8 74 0 0 0 > 1 0 0|1326M 10M 4614M 17M| 0 0 | 0 0 :4096B 0 | 160k 2048M| 16 10 74 1 0 0 > 1 0 0|1330M 10M 4609M 17M| 0 0 | 0 12k:4096B 0 | 160k 2048M| 17 9 74 0 0 0 > 0 1 0|1343M 10M 4596M 17M| 0 0 | 0 0 : 0 316M| 160k 2048M| 5 10 74 11 0 1 > 0 1 0|1339M 10M 4596M 21M| 0 0 | 0 0 : 0 0 | 160k 2048M| 0 0 74 25 0 1 > 0 2 0|1334M 10M 4596M 25M| 0 0 | 0 4096B: 0 0 | 160k 2048M| 0 0 54 44 0 1 > 1 0 0|1326M 10M 4596M 34M| 0 0 | 0 0 : 0 364k| 160k 2048M| 4 1 60 34 0 1 > 1 0 0|1290M 10M 4596M 70M| 0 0 | 0 12k: 0 0 | 160k 2048M| 24 1 75 0 0 0 > 1 0 0|1301M 10M 4596M 59M| 0 0 | 0 20k: 0 0 | 160k 2048M| 21 4 75 0 0 0 > 1 0 0|1312M 10M 4596M 48M| 0 0 | 0 0 : 0 0 | 160k 2048M| 22 4 75 0 0 0 > 1 0 0|1323M 10M 4596M 37M| 0 0 | 0 0 : 0 24k| 160k 2048M| 21 4 75 0 0 0 > 1 0 0|1334M 10M 4596M 25M| 0 0 | 0 0 : 0 56k| 160k 2048M| 21 4 75 0 0 0 > > > > On Fri, 2005-08-19 at 16:07 -0500, John A Meinel wrote: > >>Jeremiah Jahn wrote: >> >>>Rebuild in progress with just ext3 on the raid array...will see if this >>>helps the access times. If it doesn't I'll mess with the stripe size. I >>>have REINDEXED, CLUSTERED, tablespaced and cached with 'cat table/index >>> >>> >>>>/dev/null' none of this seems to have helped, or even increased my >>> >>>memory usage. argh! The only thing about this new system that I'm >>>unfamiliar with is the array setup and LVM, which is why I think that's >>>where the issue is. clustering and indexing as well as vacuum etc are >>>things that I do and have been aware of for sometime. Perhaps slony is a >>>factor, but I really don't see it causing problems on index read speed >>>esp. when it's not running. >>> >>>thanx for your help, I really appreciate it. >>>-jj- >>> >> >>By the way, how are you measuring memory usage? Can you give the output >>of that command, just to make sure you are reading it correctly. >> >>John >>=:-> >>
Attachment
Ron wrote: > At 02:53 PM 8/20/2005, Jeremiah Jahn wrote: > >> On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote: >> > Jeremiah Jahn wrote: >> > > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote: >> > > >> <snip> >> > >> > > it's cached alright. I'm getting a read rate of about 150MB/sec. I >> would >> > > have thought is would be faster with my raid setup. I think I'm >> going to >> > > scrap the whole thing and get rid of LVM. I'll just do a straight >> ext3 >> > > system. Maybe that will help. Still trying to get suggestions for a >> > > stripe size. >> > > Well, since you can get a read of the RAID at 150MB/s, that means that it is actual I/O speed. It may not be cached in RAM. Perhaps you could try the same test, only using say 1G, which should be cached. >> > >> > I don't think 150MB/s is out of the realm for a 14 drive array. >> > How fast is time dd if=/dev/zero of=testfile bs=8192 count=1000000 >> > >> time dd if=/dev/zero of=testfile bs=8192 count=1000000 >> 1000000+0 records in >> 1000000+0 records out >> >> real 1m24.248s >> user 0m0.381s >> sys 0m33.028s >> >> >> > (That should create a 8GB file, which is too big to cache everything) >> > And then how fast is: >> > time dd if=testfile of=/dev/null bs=8192 count=1000000 >> >> time dd if=testfile of=/dev/null bs=8192 count=1000000 >> 1000000+0 records in >> 1000000+0 records out >> >> real 0m54.139s >> user 0m0.326s >> sys 0m8.916s >> >> >> and on a second run: >> >> real 0m55.667s >> user 0m0.341s >> sys 0m9.013s >> >> >> > >> > That should give you a semi-decent way of measuring how fast the RAID >> > system is, since it should be too big to cache in ram. >> >> about 150MB/Sec. Is there no better way to make this go faster...? I'm actually curious about PCI bus saturation at this point. Old 32-bit 33MHz pci could only push 1Gbit = 100MB/s. Now, I'm guessing that this is a higher performance system. But I'm really surprised that your write speed is that close to your read speed. (100MB/s write, 150MB/s read). > > Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them > doing raw sequential IO like this should be capable of at > ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's, ~7*79MB/s= 553MB/s > if using Fujitsu MAU's, and ~7*86MB/s= 602MB/s if using Maxtor Atlas 15K > II's to devices external to the RAID array. I know I thought these were SATA drives, over 2 controllers. I could be completely wrong, though. > > _IF_ the controller setup is high powered enough to keep that kind of IO > rate up. This will require a controller or controllers providing dual > channel U320 bandwidth externally and quad channel U320 bandwidth > internally. IOW, it needs a controller or controllers talking 64b > 133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized > IO buffer as well. > > AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic > MegaRAID controllers. What I don't know is which exact one yours is, > nor do I know if it (or any of the MegaRAID controllers) are high > powered enough. > > Talk to your HW supplier to make sure you have controllers adequate to > your HD's. > > ...and yes, your average access time will be in the 5.5ms - 6ms range > when doing a physical seek. > Even with RAID, you want to minimize seeks and maximize sequential IO > when accessing them. > Best to not go to HD at all ;-) Well, certainly, if you can get more into RAM, you're always better off. For writing, a battery-backed write cache, and for reading lots of system RAM. > > Hope this helps, > Ron Peacetree > John =:->
Attachment
I'm Sorry, that I wrote that the option would risk the LOG persistency with PostgreSQL. I should have asked instead, that how you have taken this into account. Tom Lane's email below convinces me, that you have taken the metadata only journalling into account and still fulfill the persistency of committed transactions. This means, that Ext3 with data=writeback is safe with PostgreSQL even with a hardware reset button. Metadata only journalling is faster, when it can be used. I didn't know, that any database can keep the database guarantees with the metadata only journalling option. I looked at your problem. One of the problems is that you need to keep the certain data cached in memory all the time. That could be solved by doing SELECT COUNT(*) from to_be_cached; as a cron job. It loads the whole table into the Linux Kernel memory cache. Marko Ristola Tom Lane wrote: >Right. I think the optimal setting for a Postgres data directory is >journaled metadata, non-journaled file content. Postgres can take care >of the data integrity for itself, but it does assume that the filesystem >stays structurally sane (eg, data blocks don't get reassigned to the >wrong file), so you need a filesystem guarantee about the metadata. > >WAL files are handled in a much more conservative way (created, filled >with zeroes, and fsync'd before we ever put any valuable data in 'em). >If you have WAL on its own drive then I think Mike's recommendation of >no filesystem journalling at all for that drive is probably OK. Or >you can do same as above (journal metadata only) if you want a little >extra protection. > >And of course all this reasoning depends on the assumption that the >drive tells the truth about write-completion. If the drive does write >caching it had better be able to complete all its accepted writes before >dying in a power failure. (Hence, battery-backed write cache is OK, any >other kind is evil.) > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 2: Don't 'kill -9' the postmaster > >
On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote: > Ron wrote: > > At 02:53 PM 8/20/2005, Jeremiah Jahn wrote: > > > >> On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote: > >> > Jeremiah Jahn wrote: > >> > > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote: > >> > > > >> <snip> > >> > > >> > > it's cached alright. I'm getting a read rate of about 150MB/sec. I > >> would > >> > > have thought is would be faster with my raid setup. I think I'm > >> going to > >> > > scrap the whole thing and get rid of LVM. I'll just do a straight > >> ext3 > >> > > system. Maybe that will help. Still trying to get suggestions for a > >> > > stripe size. > >> > > > > Well, since you can get a read of the RAID at 150MB/s, that means that > it is actual I/O speed. It may not be cached in RAM. Perhaps you could > try the same test, only using say 1G, which should be cached. [root@io pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=1000000 1000000+0 records in 1000000+0 records out real 0m8.885s user 0m0.299s sys 0m6.998s [root@io pgsql]# time dd of=/dev/null if=testfile bs=1024 count=1000000 1000000+0 records in 1000000+0 records out real 0m1.654s user 0m0.232s sys 0m1.415s > > >> > > >> > I don't think 150MB/s is out of the realm for a 14 drive array. > >> > How fast is time dd if=/dev/zero of=testfile bs=8192 count=1000000 > >> > > >> time dd if=/dev/zero of=testfile bs=8192 count=1000000 > >> 1000000+0 records in > >> 1000000+0 records out > >> > >> real 1m24.248s > >> user 0m0.381s > >> sys 0m33.028s > >> > >> > >> > (That should create a 8GB file, which is too big to cache everything) > >> > And then how fast is: > >> > time dd if=testfile of=/dev/null bs=8192 count=1000000 > >> > >> time dd if=testfile of=/dev/null bs=8192 count=1000000 > >> 1000000+0 records in > >> 1000000+0 records out > >> > >> real 0m54.139s > >> user 0m0.326s > >> sys 0m8.916s > >> > >> > >> and on a second run: > >> > >> real 0m55.667s > >> user 0m0.341s > >> sys 0m9.013s > >> > >> > >> > > >> > That should give you a semi-decent way of measuring how fast the RAID > >> > system is, since it should be too big to cache in ram. > >> > >> about 150MB/Sec. Is there no better way to make this go faster...? > > I'm actually curious about PCI bus saturation at this point. Old 32-bit > 33MHz pci could only push 1Gbit = 100MB/s. Now, I'm guessing that this > is a higher performance system. But I'm really surprised that your write > speed is that close to your read speed. (100MB/s write, 150MB/s read). The raid array I have is currently set up to use a single channel. But I have dual controllers In the array. And dual external slots on the card. The machine is brand new and has pci-e backplane. > > > > > Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them > > doing raw sequential IO like this should be capable of at > > ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's, ~7*79MB/s= 553MB/s BTW I'm using Seagate Cheetah 15K.4's > > if using Fujitsu MAU's, and ~7*86MB/s= 602MB/s if using Maxtor Atlas 15K > > II's to devices external to the RAID array. > > I know I thought these were SATA drives, over 2 controllers. I could be > completely wrong, though. > > > > > _IF_ the controller setup is high powered enough to keep that kind of IO > > rate up. This will require a controller or controllers providing dual > > channel U320 bandwidth externally and quad channel U320 bandwidth > > internally. IOW, it needs a controller or controllers talking 64b > > 133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized > > IO buffer as well. > > > > AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic > > MegaRAID controllers. What I don't know is which exact one yours is, > > nor do I know if it (or any of the MegaRAID controllers) are high > > powered enough. PERC4eDC-PCI Express, 128MB Cache, 2-External Channels > > > > Talk to your HW supplier to make sure you have controllers adequate to > > your HD's. > > > > ...and yes, your average access time will be in the 5.5ms - 6ms range > > when doing a physical seek. > > Even with RAID, you want to minimize seeks and maximize sequential IO > > when accessing them. > > Best to not go to HD at all ;-) > > Well, certainly, if you can get more into RAM, you're always better off. > For writing, a battery-backed write cache, and for reading lots of > system RAM. I'm not really worried about the writing, it's the reading the reading that needs to be faster. > > > > > Hope this helps, > > Ron Peacetree > > > > John > =:-> -- Speak softly and carry a +6 two-handed sword.
Jeremiah Jahn wrote: > On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote: > >>Ron wrote: >> >>>At 02:53 PM 8/20/2005, Jeremiah Jahn wrote: >> >>Well, since you can get a read of the RAID at 150MB/s, that means that >>it is actual I/O speed. It may not be cached in RAM. Perhaps you could >>try the same test, only using say 1G, which should be cached. > > > [root@io pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=1000000 > 1000000+0 records in > 1000000+0 records out > > real 0m8.885s > user 0m0.299s > sys 0m6.998s > [root@io pgsql]# time dd of=/dev/null if=testfile bs=1024 count=1000000 > 1000000+0 records in > 1000000+0 records out > > real 0m1.654s > user 0m0.232s > sys 0m1.415s > The write time seems about the same (but you only have 128MB of write cache), but your read jumped up to 620MB/s. So you drives do seem to be giving you 150MB/s. > ... >>I'm actually curious about PCI bus saturation at this point. Old 32-bit >>33MHz pci could only push 1Gbit = 100MB/s. Now, I'm guessing that this >>is a higher performance system. But I'm really surprised that your write >>speed is that close to your read speed. (100MB/s write, 150MB/s read). > > > The raid array I have is currently set up to use a single channel. But I > have dual controllers In the array. And dual external slots on the card. > The machine is brand new and has pci-e backplane. > > > > >>>Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them >>>doing raw sequential IO like this should be capable of at >>> ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's, ~7*79MB/s= 553MB/s > > BTW I'm using Seagate Cheetah 15K.4's > Now, are the numbers that Ron is quoting in megabytes or megabits? I'm guessing he knows what he is talking about, and is doing megabytes. 80MB/s sustained seems rather high for a hard-disk. Though this page: http://www.storagereview.com/articles/200411/20041116ST3146754LW_2.html Does seem to agree with that statement. (Between 56 and 93MB/s) And since U320 is a 320MB/s bus, it doesn't seem like anything there should be saturating. So why the low performance???? >> >>>_IF_ the controller setup is high powered enough to keep that kind of IO >>>rate up. This will require a controller or controllers providing dual >>>channel U320 bandwidth externally and quad channel U320 bandwidth >>>internally. IOW, it needs a controller or controllers talking 64b >>>133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized >>>IO buffer as well. >>> >>>AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic >>>MegaRAID controllers. What I don't know is which exact one yours is, >>>nor do I know if it (or any of the MegaRAID controllers) are high >>>powered enough. > > > PERC4eDC-PCI Express, 128MB Cache, 2-External Channels Do you know which card it is? Does it look like this one: http://www.lsilogic.com/products/megaraid/megaraid_320_2e.html Judging by the 320 speed, and 2 external controllers, that is my guess. They at least claim a theoretical max of 2GB/s. Which makes you wonder why reading from RAM is only able to get throughput of 600MB/s. Did you run it multiple times? On my windows system, I get just under 550MB/s for what should be cached, copying from /dev/zero to /dev/null I get 2.4GB/s (though that might be a no-op). On a similar linux machine, I'm able to get 1200MB/s for a cached file. (And 3GB/s for a zero=>null copy). John =:-> > > >>>Talk to your HW supplier to make sure you have controllers adequate to >>>your HD's. >>> >>>...and yes, your average access time will be in the 5.5ms - 6ms range >>>when doing a physical seek. >>>Even with RAID, you want to minimize seeks and maximize sequential IO >>>when accessing them. >>>Best to not go to HD at all ;-) >> >>Well, certainly, if you can get more into RAM, you're always better off. >>For writing, a battery-backed write cache, and for reading lots of >>system RAM. > > > I'm not really worried about the writing, it's the reading the reading > that needs to be faster. > > >>>Hope this helps, >>>Ron Peacetree >>> >> >>John >>=:->
Attachment
I'm resending this as it appears not to have made it to the list. At 10:54 AM 8/21/2005, Jeremiah Jahn wrote: >On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote: > > Ron wrote: > > > > Well, since you can get a read of the RAID at 150MB/s, that means that > > it is actual I/O speed. It may not be cached in RAM. Perhaps you could > > try the same test, only using say 1G, which should be cached. > >[root@io pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=1000000 >1000000+0 records in >1000000+0 records out > >real 0m8.885s >user 0m0.299s >sys 0m6.998s This is abysmally slow. >[root@io pgsql]# time dd of=/dev/null if=testfile bs=1024 count=1000000 >1000000+0 records in >1000000+0 records out > >real 0m1.654s >user 0m0.232s >sys 0m1.415s This transfer rate is the only one out of the 4 you have posted that is in the vicinity of where it should be. >The raid array I have is currently set up to use a single channel. But I >have dual controllers in the array. And dual external slots on the card. >The machine is brand new and has pci-e backplane. > So you have 2 controllers each with 2 external slots? But you are currently only using 1 controller and only one external slot on that controller? > > > Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them > > > doing raw sequential IO like this should be capable of at > > > ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's >BTW I'm using Seagate Cheetah 15K.4's OK, now we have that nailed down. > > > AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic > > > MegaRAID controllers. What I don't know is which exact one yours is, > > > nor do I know if it (or any of the MegaRAID controllers) are high > > > powered enough. > >PERC4eDC-PCI Express, 128MB Cache, 2-External Channels Looks like they are using the LSI Logic MegaRAID SCSI 320-2E controller. IIUC, you have 2 of these, each with 2 external channels? The specs on these appear a bit strange. They are listed as being a PCI-Ex8 card, which means they should have a max bandwidth of 20Gb/s= 2GB/s, yet they are also listed as only supporting dual channel U320= 640MB/s when they could easily support quad channel U320= 1.28GB/s. Why bother building a PCI-Ex8 card when only a PCI-Ex4 card (which is a more standard physical format) would've been enough? Or if you are going to build a PCI-Ex8 card, why not support quad channel U320? This smells like there's a problem with LSI's design. The 128MB buffer also looks suspiciously small, and I do not see any upgrade path for it on LSI Logic's site. "Serious" RAID controllers from companies like Xyratex, Engino, and Dot-hill can have up to 1-2GB of buffer, and there's sound technical reasons for it. See if there's a buffer upgrade available or if you can get controllers that have larger buffer capabilities. Regardless of the above, each of these controllers should still be good for about 80-85% of 640MB/s, or ~510-540 MB/s apiece when doing raw sequential IO if you plug 3-4 fast enough HD's into each SCSI channel. Cheetah 15K.4's certainly are fast enough. Optimal setup is probably to split each RAID 1 pair so that one HD is on each of the SCSI channels, and then RAID 0 those pairs. That will also protect you from losing the entire disk subsystem if one of the SCSI channels dies. That 128MB of buffer cache may very well be too small to keep the IO rate up, and/or there may be a more subtle problem with the LSI card, and/or you may have a configuration problem, but _something(s)_ need fixing since you are only getting raw sequential IO of ~100-150MB/s when it should be above 500MB/s. This will make the most difference for initial reads (first time you load a table, first time you make a given query, etc) and for any writes. Your HW provider should be able to help you, even if some of the HW in question needs to be changed. You paid for a solution. As long as this stuff is performing at so much less then what it is supposed to, you have not received the solution you paid for. BTW, on the subject of RAID stripes IME the sweet spot tends to be in the 64KB to 256KB range (very large, very read heavy data mines can want larger RAID stripes.). Only experimentation will tell you what results in the best performance for your application. >I'm not really worried about the writing, it's the reading the reading >that needs to be faster. Initial reads are only going to be as fast as your HD subsystem, so there's a reason for making the HD subsystem faster even if all you care about is reads. In addition, I'll repeat my previous advice that upgrading to 16GB of RAM would be well worth it for you. Hope this helps, Ron Peacetree
At 10:54 AM 8/21/2005, Jeremiah Jahn wrote: >On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote: > > Ron wrote: > > > > Well, since you can get a read of the RAID at 150MB/s, that means that > > it is actual I/O speed. It may not be cached in RAM. Perhaps you could > > try the same test, only using say 1G, which should be cached. > >[root@io pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=1000000 >1000000+0 records in >1000000+0 records out > >real 0m8.885s >user 0m0.299s >sys 0m6.998s This is abysmally slow. >[root@io pgsql]# time dd of=/dev/null if=testfile bs=1024 count=1000000 >1000000+0 records in >1000000+0 records out > >real 0m1.654s >user 0m0.232s >sys 0m1.415s This transfer rate is the only one out of the 4 you have posted that is in the vicinity of where it should be. >The raid array I have is currently set up to use a single channel. But I >have dual controllers in the array. And dual external slots on the card. >The machine is brand new and has pci-e backplane. > So you have 2 controllers each with 2 external slots? But you are currently only using 1 controller and only one external slot on that controller? > > > Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them > > > doing raw sequential IO like this should be capable of at > > > ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's >BTW I'm using Seagate Cheetah 15K.4's OK, now we have that nailed down. > > > AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic > > > MegaRAID controllers. What I don't know is which exact one yours is, > > > nor do I know if it (or any of the MegaRAID controllers) are high > > > powered enough. > >PERC4eDC-PCI Express, 128MB Cache, 2-External Channels Looks like they are using the LSI Logic MegaRAID SCSI 320-2E controller. IIUC, you have 2 of these, each with 2 external channels? The specs on these appear a bit strange. They are listed as being a PCI-Ex8 card, which means they should have a max bandwidth of 20Gb/s= 2GB/s, yet they are also listed as only supporting dual channel U320= 640MB/s when they could easily support quad channel U320= 1.28GB/s. Why bother building a PCI-Ex8 card when only a PCI-Ex4 card (which is a more standard physical format) would've been enough? Or if you are going to build a PCI-Ex8 card, why not support quad channel U320? This smells like there's a problem with LSI's design. The 128MB buffer also looks suspiciously small, and I do not see any upgrade path for it on LSI Logic's site. "Serious" RAID controllers from companies like Xyratex, Engino, and Dot-hill can have up to 1-2GB of buffer, and there's sound technical reasons for it. See if there's a buffer upgrade available or if you can get controllers that have larger buffer capabilities. Regardless of the above, each of these controllers should still be good for about 80-85% of 640MB/s, or ~510-540 MB/s apiece when doing raw sequential IO if you plug 3-4 fast enough HD's into each SCSI channel. Cheetah 15K.4's certainly are fast enough. Optimal setup is probably to split each RAID 1 pair so that one HD is on each of the SCSI channels, and then RAID 0 those pairs. That will also protect you from losing the entire disk subsystem if one of the SCSI channels dies. That 128MB of buffer cache may very well be too small to keep the IO rate up, and/or there may be a more subtle problem with the LSI card, and/or you may have a configuration problem, but _something(s)_ need fixing since you are only getting raw sequential IO of ~100-150MB/s when it should be above 500MB/s. This will make the most difference for initial reads (first time you load a table, first time you make a given query, etc) and for any writes. Your HW provider should be able to help you, even if some of the HW in question needs to be changed. You paid for a solution. As long as this stuff is performing at so much less then what it is supposed to, you have not received the solution you paid for. BTW, on the subject of RAID stripes IME the sweet spot tends to be in the 64KB to 256KB range (very large, very read heavy data mines can want larger RAID stripes.). Only experimentation will tell you what results in the best performance for your application. >I'm not really worried about the writing, it's the reading the reading >that needs to be faster. Initial reads are only going to be as fast as your HD subsystem, so there's a reason for making the HD subsystem faster even if all you care about is reads. In addition, I'll repeat my previous advice that upgrading to 16GB of RAM would be well worth it for you. Hope this helps, Ron Peacetree
Ron wrote: >> PERC4eDC-PCI Express, 128MB Cache, 2-External Channels > > Looks like they are using the LSI Logic MegaRAID SCSI 320-2E > controller. IIUC, you have 2 of these, each with 2 external channels? A lot of people have mentioned Dell's versions of the LSI cards can be WAY slower than the ones you buy from LSI. Why this is the case? Nobody knows for sure. Here's a guess on my part. A while back, I was doing some googling. And instead of typing "LSI MegaRAID xxx", I just typed "MegaRAID xxx". Going beyond the initial pages, I saw Tekram -- a company that supposedly produces their own controllers -- listing products with the exact model numbers and photos as cards from LSI and Areca. Seemed puzzling until I read a review about SATA RAID cards where it mentioned Tekram produces the Areca cards under their own name but using slower components to avoid competing at the highend with them. So what may be happening is that the logic circuitry on the Dell PERCs are the same as the source LSI cards, the speed of the RAID processor/RAM/internal buffers/etc is not as fast so Dell can shave off a few bucks for every server. That would mean while a true LSI card has the processing power to do the RAID calculates for X drives, the Dell version probably can only do X*0.6 drives or so. > The 128MB buffer also looks suspiciously small, and I do not see any > upgrade path for it on LSI Logic's site. "Serious" RAID controllers > from companies like Xyratex, Engino, and Dot-hill can have up to 1-2GB The card is upgradable. If you look at the pic of the card, it shows a SDRAM DIMM versus integrated RAM chips. I've also read reviews a while back comparing benchmarks of the 320-2 w/ 128K versus 512K onboard RAM. Their product literature is just nebulous on the RAM upgrade part. I'm sure if you opened up the PDF manuals, you could find the exact info > That 128MB of buffer cache may very well be too small to keep the IO > rate up, and/or there may be a more subtle problem with the LSI card, > and/or you may have a configuration problem, but _something(s)_ need > fixing since you are only getting raw sequential IO of ~100-150MB/s when > it should be above 500MB/s. I think it just might be the Dell hardware or the lack of 64-bit IOMMU on Xeon's. Here's my numbers on 320-1 w/ 128K paired up with Opterons compared to Jeremiah's. >> # time dd if=/dev/zero of=testfile bs=1024 count=1000000 >> 1000000+0 records in >> 1000000+0 records out >> >> real 0m8.885s >> user 0m0.299s >> sys 0m6.998s 2x15K RAID1 real 0m14.493s user 0m0.255s sys 0m11.712s 6x15K RAID10 (2x 320-1) real 0m9.986s user 0m0.200s sys 0m8.634s >> # time dd of=/dev/null if=testfile bs=1024 count=1000000 >> 1000000+0 records in >> 1000000+0 records out >> >> real 0m1.654s >> user 0m0.232s >> sys 0m1.415s 2x15K RAID1 real 0m3.383s user 0m0.176s sys 0m3.207s 6x15K RAID10 (2x 320-1) real 0m2.427s user 0m0.178s sys 0m2.250s If all 14 HDs are arranged in a RAID10 array, I'd say there's definitely something wrong with Jeremiah's hardware.
On Sun, 2005-08-21 at 16:13 -0400, Ron wrote: > At 10:54 AM 8/21/2005, Jeremiah Jahn wrote: > >On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote: > > > Ron wrote: > > > > > > Well, since you can get a read of the RAID at 150MB/s, that means that > > > it is actual I/O speed. It may not be cached in RAM. Perhaps you could > > > try the same test, only using say 1G, which should be cached. > > > >[root@io pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=1000000 > >1000000+0 records in > >1000000+0 records out > > > >real 0m8.885s > >user 0m0.299s > >sys 0m6.998s > > This is abysmally slow. > > > >[root@io pgsql]# time dd of=/dev/null if=testfile bs=1024 count=1000000 > >1000000+0 records in > >1000000+0 records out > > > >real 0m1.654s > >user 0m0.232s > >sys 0m1.415s > > This transfer rate is the only one out of the 4 you have posted that > is in the vicinity of where it should be. > > > >The raid array I have is currently set up to use a single channel. But I > >have dual controllers in the array. And dual external slots on the card. > >The machine is brand new and has pci-e backplane. > > > So you have 2 controllers each with 2 external slots? But you are > currently only using 1 controller and only one external slot on that > controller? Sorry, no. I have one dual channel card in the system and two controllers on the array. Dell PowerVault 220S w/ PERC4eDC-PCI Express > > > > > > Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them > > > > doing raw sequential IO like this should be capable of at > > > > ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's > >BTW I'm using Seagate Cheetah 15K.4's > > OK, now we have that nailed down. > > > > > > AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic > > > > MegaRAID controllers. What I don't know is which exact one yours is, > > > > nor do I know if it (or any of the MegaRAID controllers) are high > > > > powered enough. > > > >PERC4eDC-PCI Express, 128MB Cache, 2-External Channels > > Looks like they are using the LSI Logic MegaRAID SCSI 320-2E > controller. IIUC, you have 2 of these, each with 2 external channels? > > The specs on these appear a bit strange. They are listed as being a > PCI-Ex8 card, which means they should have a max bandwidth of 20Gb/s= > 2GB/s, yet they are also listed as only supporting dual channel U320= > 640MB/s when they could easily support quad channel U320= > 1.28GB/s. Why bother building a PCI-Ex8 card when only a PCI-Ex4 > card (which is a more standard physical format) would've been > enough? Or if you are going to build a PCI-Ex8 card, why not support > quad channel U320? This smells like there's a problem with LSI's design. > > The 128MB buffer also looks suspiciously small, and I do not see any > upgrade path for it on LSI Logic's site. "Serious" RAID controllers > from companies like Xyratex, Engino, and Dot-hill can have up to > 1-2GB of buffer, and there's sound technical reasons for it. See if > there's a buffer upgrade available or if you can get controllers that > have larger buffer capabilities. > > Regardless of the above, each of these controllers should still be > good for about 80-85% of 640MB/s, or ~510-540 MB/s apiece when doing > raw sequential IO if you plug 3-4 fast enough HD's into each SCSI > channel. Cheetah 15K.4's certainly are fast enough. Optimal setup > is probably to split each RAID 1 pair so that one HD is on each of > the SCSI channels, and then RAID 0 those pairs. That will also > protect you from losing the entire disk subsystem if one of the SCSI > channels dies. I like this idea, but how exactly does one bond the two channels together? Won't this cause me to have both an /dev/sdb and an /dev/sdc? > > That 128MB of buffer cache may very well be too small to keep the IO > rate up, and/or there may be a more subtle problem with the LSI card, > and/or you may have a configuration problem, but _something(s)_ need > fixing since you are only getting raw sequential IO of ~100-150MB/s > when it should be above 500MB/s. It looks like there's a way to add more memory to it. > > This will make the most difference for initial reads (first time you > load a table, first time you make a given query, etc) and for any writes. > > Your HW provider should be able to help you, even if some of the HW > in question needs to be changed. You paid for a solution. As long > as this stuff is performing at so much less then what it is supposed > to, you have not received the solution you paid for. > > BTW, on the subject of RAID stripes IME the sweet spot tends to be in > the 64KB to 256KB range (very large, very read heavy data mines can > want larger RAID stripes.). Only experimentation will tell you what > results in the best performance for your application. I think I have them very small at the moment. > > > >I'm not really worried about the writing, it's the reading the reading > >that needs to be faster. > > Initial reads are only going to be as fast as your HD subsystem, so > there's a reason for making the HD subsystem faster even if all you > care about is reads. In addition, I'll repeat my previous advice > that upgrading to 16GB of RAM would be well worth it for you. 12GB is my max. I may run with it for a while and see. > > Hope this helps, > Ron Peacetree > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match -- Speak softly and carry a +6 two-handed sword.
Jeremiah Jahn wrote: > On Sun, 2005-08-21 at 16:13 -0400, Ron wrote: > >>At 10:54 AM 8/21/2005, Jeremiah Jahn wrote: >> ... >>So you have 2 controllers each with 2 external slots? But you are >>currently only using 1 controller and only one external slot on that >>controller? > > > Sorry, no. I have one dual channel card in the system and two > controllers on the array. Dell PowerVault 220S w/ PERC4eDC-PCI Express > > ... >>Regardless of the above, each of these controllers should still be >>good for about 80-85% of 640MB/s, or ~510-540 MB/s apiece when doing >>raw sequential IO if you plug 3-4 fast enough HD's into each SCSI >>channel. Cheetah 15K.4's certainly are fast enough. Optimal setup >>is probably to split each RAID 1 pair so that one HD is on each of >>the SCSI channels, and then RAID 0 those pairs. That will also >>protect you from losing the entire disk subsystem if one of the SCSI >>channels dies. > > I like this idea, but how exactly does one bond the two channels > together? Won't this cause me to have both an /dev/sdb and an /dev/sdc? > Well, even if you did, you could always either use software raid, or lvm to turn it into a single volume. It also depends what the controller card bios would let you get away with. Some cards would let you setup 4 RAID1's (one drive from each channel), and then create a RAID0 of those pairs. Software raid should do this without any problem. And can even be done such that it can be grown in the future, as well as work across multiple cards (though the latter is supported by some cards as well). > > >>That 128MB of buffer cache may very well be too small to keep the IO >>rate up, and/or there may be a more subtle problem with the LSI card, >>and/or you may have a configuration problem, but _something(s)_ need >>fixing since you are only getting raw sequential IO of ~100-150MB/s >>when it should be above 500MB/s. > > > It looks like there's a way to add more memory to it. This memory probably helps more in writing than reading. If you are reading the same area over and over, it might end up being a little bit of extra cache for that (but it should already be cached in system RAM, so you don't really get anything). ... >>Initial reads are only going to be as fast as your HD subsystem, so >>there's a reason for making the HD subsystem faster even if all you >>care about is reads. In addition, I'll repeat my previous advice >>that upgrading to 16GB of RAM would be well worth it for you. > > > 12GB is my max. I may run with it for a while and see. If your working set truly is 10GB, then you can get a massive performance increase even at 12GB. If your working set is 10GB and you have 6GB of RAM, it probably is always swapping out what it just read for the new stuff, even though you will read that same thing again in a few seconds. So rather than just paying for the 4GB that can't be cached, you pay for the whole 10. John =:-> > > >>Hope this helps, >>Ron Peacetree >> >> >> >>---------------------------(end of broadcast)--------------------------- >>TIP 9: In versions below 8.0, the planner will ignore your desire to >> choose an index scan if your joining column's datatypes do not >> match