Thread: extremly low memory usage

From:
Jeremiah Jahn
Date:

I just put together a system with 6GB of ram on a 14 disk raid 10 array.
When I run my usual big painful queries, I get very little to know
memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used
most of the time. the new devel box sits at around 250MB.

I've switched to an 8.0 system on the new devel box, but the .conf
really didn't change. Index usage is the same. Something seems wrong and
I'm not sure why.


any thoughts,
-jj-


shared_buffers = 32768          # min 16, at least max_connections*2, 8KB each
work_mem = 2097151              # min 64, size in KB
maintenance_work_mem = 819200   # min 1024, size in KB
max_fsm_pages = 80000           # min max_fsm_relations*16, 6 bytes each
checkpoint_segments = 30        # in logfile segments, min 1, 16MB each
effective_cache_size = 3600000   <-----this is a little out of control, but would it have any real effect?
random_page_cost = 2            # units are one sequential page fetch cost
log_min_duration_statement = 10000 # -1 is disabled, in milliseconds.
lc_messages = 'C'               # locale for system error message strings
lc_monetary = 'C'               # locale for monetary formatting
lc_numeric = 'C'                # locale for number formatting
lc_time = 'C'                   # locale for time formatting



--
"Now this is a totally brain damaged algorithm.  Gag me with a
smurfette."
                -- P. Buhr, Computer Science 354


From:
John A Meinel
Date:

Jeremiah Jahn wrote:
> I just put together a system with 6GB of ram on a 14 disk raid 10 array.
> When I run my usual big painful queries, I get very little to know
> memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used
> most of the time. the new devel box sits at around 250MB.
>
> I've switched to an 8.0 system on the new devel box, but the .conf
> really didn't change. Index usage is the same. Something seems wrong and
> I'm not sure why.
>

How big is your actual database on disk? And how much of it is actually
touched by your queries?

It seems that your tough queries might only be exercising a portion of
the database. If you really want to make memory usage increase try
something like:
find . -type f -print0 | xargs -0 cat >/dev/null
Which should read all the files. After doing that, does the memory usage
increase?

>
> any thoughts,
> -jj-
>
>
> shared_buffers = 32768          # min 16, at least max_connections*2, 8KB each
> work_mem = 2097151              # min 64, size in KB

This seems awfully high. 2GB Per sort? This might actually be flushing
some of your ram, since it would get allocated and filled, and then
freed when finished. Remember, depending on what you are doing, this
amount can get allocated more than once per query.

> maintenance_work_mem = 819200   # min 1024, size in KB
> max_fsm_pages = 80000           # min max_fsm_relations*16, 6 bytes each
> checkpoint_segments = 30        # in logfile segments, min 1, 16MB each
> effective_cache_size = 3600000   <-----this is a little out of control, but would it have any real effect?

It should just tell the planner that it is more likely to have buffers
in cache, so index scans are slightly cheaper than they would otherwise be.

> random_page_cost = 2            # units are one sequential page fetch cost
> log_min_duration_statement = 10000 # -1 is disabled, in milliseconds.
> lc_messages = 'C'               # locale for system error message strings
> lc_monetary = 'C'               # locale for monetary formatting
> lc_numeric = 'C'                # locale for number formatting
> lc_time = 'C'                   # locale for time formatting
>

John
=:->

From:
Jeff Trout
Date:

On Aug 17, 2005, at 10:11 PM, Jeremiah Jahn wrote:

> I just put together a system with 6GB of ram on a 14 disk raid 10
> array.
> When I run my usual big painful queries, I get very little to know
> memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used
> most of the time. the new devel box sits at around 250MB.
>

Is the system performing fine?  Are you touching as much data as the
production box?

If the system is performing fine don't worry about it.

> work_mem = 2097151              # min 64, size in KB

This is EXTREMELY high.  You realize this is the amount of memory
that can be used per-sort and per-hash build in a query? You can end
up with multiples of this on a single query.   If you have some big
queries that are run infrequently have them set it manually.

> effective_cache_size = 3600000   <-----this is a little out of
> control, but would it have any real effect?

This doesn't allocate anything - it is a hint to the planner about
how much data it can assume is cached.

--
Jeff Trout <>
http://www.jefftrout.com/
http://www.stuarthamm.net/



From:
Jeremiah Jahn
Date:

On Wed, 2005-08-17 at 21:21 -0500, John A Meinel wrote:
> Jeremiah Jahn wrote:
> > I just put together a system with 6GB of ram on a 14 disk raid 10 array.
> > When I run my usual big painful queries, I get very little to know
> > memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used
> > most of the time. the new devel box sits at around 250MB.
> >
> > I've switched to an 8.0 system on the new devel box, but the .conf
> > really didn't change. Index usage is the same. Something seems wrong and
> > I'm not sure why.
> >
>
> How big is your actual database on disk? And how much of it is actually
> touched by your queries?
The DB is about 60GB. About 10GB is actually used in real queries,
versus get me this single record with this ID. I have a large query that
finds court cases based on certain criteria that is name based. I get a
full seq scan on the name table in about 7 seconds, This table has about
6 million names (most being 'smith, something'). The index scan takes
much less time of course, once it's been cached (somewhere but not
apparently memory). The really query can take 60 seconds on a first run.
And 1.3 seconds on a second run. I'm very happy with the cached results,
just not really sure where that caching is happening since it doesn't
show up as memory usage. I do know that the caching that happens seems
to be independent of the DB. I can restart the DB and my speeds are
still the same as the cached second query. Is there some way to
pre-cache some of the tables/files on the file system? If I switch my
query to search for 'jones%' instead of 'smith%', I take a hit. But if I
then rerun the smith search, I still get cached speed. I only have two
tables essentially names and events that have to do any real work ie.
not very atomic data. I'd love to be able to force these two tables into
a cache somewhere. This is a linux system (RHEL ES4) by the way.
>
> It seems that your tough queries might only be exercising a portion of
> the database. If you really want to make memory usage increase try
> something like:
> find . -type f -print0 | xargs -0 cat >/dev/null
> Which should read all the files. After doing that, does the memory usage
> increase?
>
> >
> > any thoughts,
> > -jj-
> >
> >
> > shared_buffers = 32768          # min 16, at least max_connections*2, 8KB each
> > work_mem = 2097151              # min 64, size in KB
>
> This seems awfully high. 2GB Per sort? This might actually be flushing
> some of your ram, since it would get allocated and filled, and then
> freed when finished. Remember, depending on what you are doing, this
> amount can get allocated more than once per query.
What's a good way to determine the optimal size?

>
> > maintenance_work_mem = 819200   # min 1024, size in KB
> > max_fsm_pages = 80000           # min max_fsm_relations*16, 6 bytes each
> > checkpoint_segments = 30        # in logfile segments, min 1, 16MB each
> > effective_cache_size = 3600000   <-----this is a little out of control, but would it have any real effect?
>
> It should just tell the planner that it is more likely to have buffers
> in cache, so index scans are slightly cheaper than they would otherwise be.
>
> > random_page_cost = 2            # units are one sequential page fetch cost
> > log_min_duration_statement = 10000 # -1 is disabled, in milliseconds.
> > lc_messages = 'C'               # locale for system error message strings
> > lc_monetary = 'C'               # locale for monetary formatting
> > lc_numeric = 'C'                # locale for number formatting
> > lc_time = 'C'                   # locale for time formatting
> >
>
> John
> =:->
--
"Now this is a totally brain damaged algorithm.  Gag me with a
smurfette."
                -- P. Buhr, Computer Science 354


From:
John Arbash Meinel
Date:

Jeremiah Jahn wrote:

>On Wed, 2005-08-17 at 21:21 -0500, John A Meinel wrote:
>
>
>>Jeremiah Jahn wrote:
>>
>>
>>>I just put together a system with 6GB of ram on a 14 disk raid 10 array.
>>>When I run my usual big painful queries, I get very little to know
>>>memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used
>>>most of the time. the new devel box sits at around 250MB.
>>>
>>>I've switched to an 8.0 system on the new devel box, but the .conf
>>>really didn't change. Index usage is the same. Something seems wrong and
>>>I'm not sure why.
>>>
>>>
>>>
>>How big is your actual database on disk? And how much of it is actually
>>touched by your queries?
>>
>>
>The DB is about 60GB. About 10GB is actually used in real queries,
>versus get me this single record with this ID. I have a large query that
>finds court cases based on certain criteria that is name based. I get a
>full seq scan on the name table in about 7 seconds, This table has about
>6 million names (most being 'smith, something'). The index scan takes
>much less time of course, once it's been cached (somewhere but not
>apparently memory). The really query can take 60 seconds on a first run.
>And 1.3 seconds on a second run. I'm very happy with the cached results,
>just not really sure where that caching is happening since it doesn't
>show up as memory usage. I do know that the caching that happens seems
>to be independent of the DB. I can restart the DB and my speeds are
>still the same as the cached second query. Is there some way to
>pre-cache some of the tables/files on the file system? If I switch my
>query to search for 'jones%' instead of 'smith%', I take a hit. But if I
>then rerun the smith search, I still get cached speed. I only have two
>tables essentially names and events that have to do any real work ie.
>not very atomic data. I'd love to be able to force these two tables into
>a cache somewhere. This is a linux system (RHEL ES4) by the way.
>
>
I think what is happening is that *some* of the index pages are being
cached, just not all of them. Most indexes (if you didn't specify
anything special) are btree, so that you load the root page, and then
determine what pages need to be loaded from there. So the "jones%" pages
aren't anywhere near the "smith%" pages. And don't need to be loaded if
you aren't accessing them.

So the required memory usage might be smaller than you think. At least
until all of the index pages have been accessed.

The reason it is DB independent is because the OS is caching a file
access (you read a file, it keeps the old pages in RAM in case you ask
for it again).

Part of the trick, is that as you use the database, it will cache what
has been used. So you may not need to do anything. It should sort itself
out with time.
However, if you have to have cached performance as soon as your machine
reboots, you could figure out what files on disk represent your indexes
and tables, and then just "cat $files >/dev/null"
That should cause a read on those files, which should pull them into the
memory cache. *However* this will fail if the size of those files is
greater than available memory, so you may want to be a little bit stingy
about what you preload.
Alternatively, you could just write an SQL script which runs a bunch of
indexed queries to make sure all the pages get loaded.

Something like:
FOR curname IN SELECT DISTINCT name FROM users LOOP
  SELECT name FROM users WHERE name=curname;
END LOOP;

That should make the database go through the entire table, and load the
index for every user. This is overkill, and will probably take a long
time to execute. But you could do it if you wanted.

>>It seems that your tough queries might only be exercising a portion of
>>the database. If you really want to make memory usage increase try
>>something like:
>>find . -type f -print0 | xargs -0 cat >/dev/null
>>Which should read all the files. After doing that, does the memory usage
>>increase?
>>
>>
>>
>>>any thoughts,
>>>-jj-
>>>
>>>
>>>shared_buffers = 32768          # min 16, at least max_connections*2, 8KB each
>>>work_mem = 2097151              # min 64, size in KB
>>>
>>>
>>This seems awfully high. 2GB Per sort? This might actually be flushing
>>some of your ram, since it would get allocated and filled, and then
>>freed when finished. Remember, depending on what you are doing, this
>>amount can get allocated more than once per query.
>>
>>
>What's a good way to determine the optimal size?
>
>

Practice. :) A few questions I guess...
How many concurrent connections are you expecting? How many joins does a
standard query have? How big are the joins?

In general, I would tend to make this a smaller number, so that the os
has more room to cache tables, rather than having big buffers for joins.
If someone is requesting a join that requires a lot of rows, I would
rather *that* query be slower, than impacting everyone else.
I would put it more with a maximum in the 20-100MB range.

John
=:->

>
>
>>>maintenance_work_mem = 819200   # min 1024, size in KB
>>>max_fsm_pages = 80000           # min max_fsm_relations*16, 6 bytes each
>>>checkpoint_segments = 30        # in logfile segments, min 1, 16MB each
>>>effective_cache_size = 3600000   <-----this is a little out of control, but would it have any real effect?
>>>
>>>
>>It should just tell the planner that it is more likely to have buffers
>>in cache, so index scans are slightly cheaper than they would otherwise be.
>>
>>
>>
>>>random_page_cost = 2            # units are one sequential page fetch cost
>>>log_min_duration_statement = 10000 # -1 is disabled, in milliseconds.
>>>lc_messages = 'C'               # locale for system error message strings
>>>lc_monetary = 'C'               # locale for monetary formatting
>>>lc_numeric = 'C'                # locale for number formatting
>>>lc_time = 'C'                   # locale for time formatting
>>>
>>>
>>>
>>John
>>=:->
>>
>>


From:
Jeremiah Jahn
Date:

here's an example standard query. Ireally have to make the first hit go
faster. The table is clustered as well on full_name as well. 'Smith%'
took 87 seconds on the first hit. I wonder if I set up may array wrong.
I remeber see something about DMA access versus something else, and
choose DMA access. LVM maybe?

explain analyze select distinct
case_category,identity_id,court.name,litigant_details.case_id,case_year,date_of_birth,assigned_case_role,litigant_details.court_ori,full_name,litigant_details.actor_id,case_data.type_code,case_data.subtype_code,litigant_details.impound_litigant_data,
to_number(trim(leadingcase_data.type_code from trim(leading case_data.case_year from case_data.case_id)),'999999') as
seqfrom identity,court,litigant_details,case_data where identity.court_ori = litigant_details.court_ori and
identity.case_id= litigant_details.case_id and identity.actor_id = litigant_details.actor_id and court.id =
identity.court_oriand identity.court_ori = case_data.court_ori and case_data.case_id = identity.case_id  and
identity.court_ori= 'IL081025J' and full_name like 'MILLER%' order by  full_name; 

                                                                                                             QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=20411.84..20411.91 rows=2 width=173) (actual time=38340.231..38355.120 rows=4906 loops=1)
   ->  Sort  (cost=20411.84..20411.84 rows=2 width=173) (actual time=38340.227..38343.667 rows=4906 loops=1)
         Sort Key: identity.full_name, case_data.case_category, identity.identity_id, court.name,
litigant_details.case_id,case_data.case_year, identity.date_of_birth, litigant_details.assigned_case_role,
litigant_details.court_ori,litigant_details.actor_id, case_data.type_code, case_data.subtype_code,
litigant_details.impound_litigant_data,to_number(ltrim(ltrim((case_data.case_id)::text, (case_data.case_year)::text),
(case_data.type_code)::text),'999999'::text) 
         ->  Nested Loop  (cost=0.00..20411.83 rows=2 width=173) (actual time=12.891..38317.017 rows=4906 loops=1)
               ->  Nested Loop  (cost=0.00..20406.48 rows=1 width=159) (actual time=12.826..23232.106 rows=4906
loops=1)
                     ->  Nested Loop  (cost=0.00..20403.18 rows=1 width=138) (actual time=12.751..22885.439 rows=4906
loops=1)
                           Join Filter: (("outer".case_id)::text = ("inner".case_id)::text)
                           ->  Index Scan using name_speed on identity  (cost=0.00..1042.34 rows=4868 width=82) (actual
time=0.142..52.538rows=4915 loops=1) 
                                 Index Cond: (((full_name)::text >= 'MILLER'::character varying) AND ((full_name)::text
<'MILLES'::character varying)) 
                                 Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~
'MILLER%'::text))
                           ->  Index Scan using lit_actor_speed on litigant_details  (cost=0.00..3.96 rows=1 width=81)
(actualtime=4.631..4.635 rows=1 loops=4915) 
                                 Index Cond: (("outer".actor_id)::text = (litigant_details.actor_id)::text)
                                 Filter: ('IL081025J'::text = (court_ori)::text)
                     ->  Seq Scan on court  (cost=0.00..3.29 rows=1 width=33) (actual time=0.053..0.062 rows=1
loops=4906)
                           Filter: ('IL081025J'::text = (id)::text)
               ->  Index Scan using case_speed on case_data  (cost=0.00..5.29 rows=3 width=53) (actual
time=3.049..3.058rows=1 loops=4906) 
                     Index Cond: (('IL081025J'::text = (case_data.court_ori)::text) AND ((case_data.case_id)::text =
("outer".case_id)::text))
 Total runtime: 38359.722 ms
(18 rows)

copa=> explain analyze select distinct
case_category,identity_id,court.name,litigant_details.case_id,case_year,date_of_birth,assigned_case_role,litigant_details.court_ori,full_name,litigant_details.actor_id,case_data.type_code,case_data.subtype_code,litigant_details.impound_litigant_data,
to_number(trim(leadingcase_data.type_code from trim(leading case_data.case_year from case_data.case_id)),'999999') as
seqfrom identity,court,litigant_details,case_data where identity.court_ori = litigant_details.court_ori and
identity.case_id= litigant_details.case_id and identity.actor_id = litigant_details.actor_id and court.id =
identity.court_oriand identity.court_ori = case_data.court_ori and case_data.case_id = identity.case_id  and
identity.court_ori= 'IL081025J' and full_name like 'MILLER%' order by  full_name; 

                                                                                                             QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=20411.84..20411.91 rows=2 width=173) (actual time=666.832..688.081 rows=4906 loops=1)
   ->  Sort  (cost=20411.84..20411.84 rows=2 width=173) (actual time=666.825..671.833 rows=4906 loops=1)
         Sort Key: identity.full_name, case_data.case_category, identity.identity_id, court.name,
litigant_details.case_id,case_data.case_year, identity.date_of_birth, litigant_details.assigned_case_role,
litigant_details.court_ori,litigant_details.actor_id, case_data.type_code, case_data.subtype_code,
litigant_details.impound_litigant_data,to_number(ltrim(ltrim((case_data.case_id)::text, (case_data.case_year)::text),
(case_data.type_code)::text),'999999'::text) 
         ->  Nested Loop  (cost=0.00..20411.83 rows=2 width=173) (actual time=0.216..641.366 rows=4906 loops=1)
               ->  Nested Loop  (cost=0.00..20406.48 rows=1 width=159) (actual time=0.149..477.063 rows=4906 loops=1)
                     ->  Nested Loop  (cost=0.00..20403.18 rows=1 width=138) (actual time=0.084..161.045 rows=4906
loops=1)
                           Join Filter: (("outer".case_id)::text = ("inner".case_id)::text)
                           ->  Index Scan using name_speed on identity  (cost=0.00..1042.34 rows=4868 width=82) (actual
time=0.047..37.898rows=4915 loops=1) 
                                 Index Cond: (((full_name)::text >= 'MILLER'::character varying) AND ((full_name)::text
<'MILLES'::character varying)) 
                                 Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~
'MILLER%'::text))
                           ->  Index Scan using lit_actor_speed on litigant_details  (cost=0.00..3.96 rows=1 width=81)
(actualtime=0.015..0.017 rows=1 loops=4915) 
                                 Index Cond: (("outer".actor_id)::text = (litigant_details.actor_id)::text)
                                 Filter: ('IL081025J'::text = (court_ori)::text)
                     ->  Seq Scan on court  (cost=0.00..3.29 rows=1 width=33) (actual time=0.049..0.056 rows=1
loops=4906)
                           Filter: ('IL081025J'::text = (id)::text)
               ->  Index Scan using case_speed on case_data  (cost=0.00..5.29 rows=3 width=53) (actual
time=0.017..0.020rows=1 loops=4906) 
                     Index Cond: (('IL081025J'::text = (case_data.court_ori)::text) AND ((case_data.case_id)::text =
("outer".case_id)::text))
 Total runtime: 694.639 ms
(18 rows)



On Thu, 2005-08-18 at 09:00 -0400, Jeff Trout wrote:
> On Aug 17, 2005, at 10:11 PM, Jeremiah Jahn wrote:
>
> > I just put together a system with 6GB of ram on a 14 disk raid 10
> > array.
> > When I run my usual big painful queries, I get very little to know
> > memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used
> > most of the time. the new devel box sits at around 250MB.
> >
>
> Is the system performing fine?  Are you touching as much data as the
> production box?
>
> If the system is performing fine don't worry about it.
>
> > work_mem = 2097151              # min 64, size in KB
>
> This is EXTREMELY high.  You realize this is the amount of memory
> that can be used per-sort and per-hash build in a query? You can end
> up with multiples of this on a single query.   If you have some big
> queries that are run infrequently have them set it manually.
>
> > effective_cache_size = 3600000   <-----this is a little out of
> > control, but would it have any real effect?
>
> This doesn't allocate anything - it is a hint to the planner about
> how much data it can assume is cached.
>
> --
> Jeff Trout <>
> http://www.jefftrout.com/
> http://www.stuarthamm.net/
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
--
"Now this is a totally brain damaged algorithm.  Gag me with a
smurfette."
                -- P. Buhr, Computer Science 354


From:
John Arbash Meinel
Date:

Jeremiah Jahn wrote:

>here's an example standard query. Ireally have to make the first hit go
>faster. The table is clustered as well on full_name as well. 'Smith%'
>took 87 seconds on the first hit. I wonder if I set up may array wrong.
>I remeber see something about DMA access versus something else, and
>choose DMA access. LVM maybe?
>
>
It would be nice if you would format your queries to be a little bit
easier to read before posting them.
However, I believe I am reading it correctly, to say that the index scan
on identity is not your slow point. In fact, as near as I can tell, it
only takes 52ms to complete.

The expensive parts are the 4915 lookups into the litigant_details (each
one takes approx 4ms for a total of ~20s).
And then you do it again on case_data (average 3ms each * 4906 loops =
~15s).

So there is no need for preloading your indexes on the identity table.
It is definitely not the bottleneck.

So a few design bits, which may help your database.
Why is "actor_id" a text field instead of a number?
You could try creating an index on "litigant_details (actor_id,
count_ori)" so that it can do just an index lookup, rather than an index
+ filter.

More importantly, though, the planner seems to think the join of
identity to litigant_details will only return 1 row, not 5000.
Do you regularly vacuum analyze your tables?
Just as a test, try running:
set enable_nested_loop to off;
And then run EXPLAIN ANALYZE again, just to see if it is faster.

You probably need to increase some statistics targets, so that the
planner can design better plans.

>          ->  Nested Loop  (cost=0.00..20411.83 rows=2 width=173)
> (actual time=12.891..38317.017 rows=4906 loops=1)
>                ->  Nested Loop  (cost=0.00..20406.48 rows=1 width=159)
> (actual time=12.826..23232.106 rows=4906 loops=1)
>                      ->  Nested Loop  (cost=0.00..20403.18 rows=1
> width=138) (actual time=12.751..22885.439 rows=4906 loops=1)
>                            Join Filter: (("outer".case_id)::text =
> ("inner".case_id)::text)
>                            ->  Index Scan using name_speed on
> identity  (cost=0.00..1042.34 rows=4868 width=82) (actual
> time=0.142..52.538 rows=4915 loops=1)
>                                  Index Cond: (((full_name)::text >=
> 'MILLER'::character varying) AND ((full_name)::text <
> 'MILLES'::character varying))
>                                  Filter: (((court_ori)::text =
> 'IL081025J'::text) AND ((full_name)::text ~~ 'MILLER%'::text))
>                            ->  Index Scan using lit_actor_speed on
> litigant_details  (cost=0.00..3.96 rows=1 width=81) (actual
> time=4.631..4.635 rows=1 loops=4915)
>                                  Index Cond: (("outer".actor_id)::text
> = (litigant_details.actor_id)::text)
>                                  Filter: ('IL081025J'::text =
> (court_ori)::text)
>                      ->  Seq Scan on court  (cost=0.00..3.29 rows=1
> width=33) (actual time=0.053..0.062 rows=1 loops=4906)
>                            Filter: ('IL081025J'::text = (id)::text)
>                ->  Index Scan using case_speed on case_data
> (cost=0.00..5.29 rows=3 width=53) (actual time=3.049..3.058 rows=1
> loops=4906)
>                      Index Cond: (('IL081025J'::text =
> (case_data.court_ori)::text) AND ((case_data.case_id)::text =
> ("outer".case_id)::text))


John
=:->


From:
Ron
Date:

At 01:55 PM 8/18/2005, John Arbash Meinel wrote:
>Jeremiah Jahn wrote:
>
> >here's an example standard query. Ireally have to make the first hit go
> >faster. The table is clustered as well on full_name as well. 'Smith%'
> >took 87 seconds on the first hit. I wonder if I set up may array wrong.
> >I remeber see something about DMA access versus something else, and
> >choose DMA access. LVM maybe?
> >
> >
>It would be nice if you would format your queries to be a little bit
>easier to read before posting them.
>However, I believe I am reading it correctly, to say that the index scan
>on identity is not your slow point. In fact, as near as I can tell, it
>only takes 52ms to complete.
>
>The expensive parts are the 4915 lookups into the litigant_details (each
>one takes approx 4ms for a total of ~20s).
>And then you do it again on case_data (average 3ms each * 4906 loops =
>~15s).

How big are litigant_details and case_data?  If they can fit in RAM,
preload them using methods like the "cat to /dev/null" trick and
those table lookups will be ~100-1000x faster.  If they won't fit
into RAM but the machine can be expanded to hold enough RAM to fit
the tables, it's well worth the ~$75-$150/GB to upgrade the server so
that the tables will fit into RAM.

If they can't be made to fit into RAM as atomic entities, you have a
few choices:
A= Put the data tables and indexes on separate dedicated spindles and
put litigant_details and case_data each on their own dedicated
spindles.  This will lower seek conflicts.  Again it this requires
buying some more HDs, it's well worth it.

B= Break litigant_details and case_data into a set of smaller tables
(based on something sane like the first n characters of the primary key)
such that the smaller tables easily fit into RAM.  Given that you've
said only 10GB/60GB is "hot", this could work very well.  Combine it
with "A" above (put all the litigant_details sub tables on one
dedicated spindle set and all the case_data sub tables on another
spindle set) for added oomph.

C= Buy a SSD big enough to hold litigant_details and case_data and
put them there.  Again, this can be combined with "A" and "B" above
to lessen the size of the SSD needed.


>So there is no need for preloading your indexes on the identity
>table.  It is definitely not the bottleneck.
>
>So a few design bits, which may help your database.  Why is
>"actor_id" a text field instead of a number?
>You could try creating an index on "litigant_details (actor_id,
>count_ori)" so that it can do just an index lookup, rather than an
>index+ filter.

Yes, that certainly sounds like it would be more efficient.


>More importantly, though, the planner seems to think the join of
>identity to litigant_details will only return 1 row, not 5000.
>Do you regularly vacuum analyze your tables?
>Just as a test, try running:
>set enable_nested_loop to off;
>And then run EXPLAIN ANALYZE again, just to see if it is faster.
>
>You probably need to increase some statistics targets, so that the
>planner can design better plans.
>
> > ->  Nested Loop  (cost=0.00..20411.83 rows=2 width=173) (actual
> time=12.891..38317.017 rows=4906 loops=1)
> >    ->  Nested Loop  (cost=0.00..20406.48 rows=1 width=159)(actual
> time=12.826..23232.106 rows=4906 loops=1)
> >       ->  Nested Loop  (cost=0.00..20403.18 rows=1 width=138)
> (actual time=12.751..22885.439 rows=4906 loops=1)
> >               Join Filter: (("outer".case_id)::text =
> ("inner".case_id)::text)
> >               ->  Index Scan using name_speed on
> identity  (cost=0.00..1042.34 rows=4868 width=82) (actual time=0.142..52.538
> > rows=4915 loops=1)
> >                    Index Cond: (((full_name)::text >=
> 'MILLER'::character varying) AND ((full_name)::text <
> 'MILLES'::character varying))
> >                    Filter: (((court_ori)::text =
> 'IL081025J'::text) AND ((full_name)::text ~~ 'MILLER%'::text))
> >                    ->  Index Scan using lit_actor_speed on
> litigant_details  (cost=0.00..3.96 rows=1 width=81) (actual
> > time=4.631..4.635 rows=1 loops=4915)
> >                         Index Cond: (("outer".actor_id)::text =
> (litigant_details.actor_id)::text)
> >                         Filter: ('IL081025J'::text = (court_ori)::text)
> >                         ->  Seq Scan on court  (cost=0.00..3.29
> rows=1 width=33) (actual time=0.053..0.062 rows=1 loops=4906)
> >                              Filter: ('IL081025J'::text = (id)::text)
> >                              ->  Index Scan using case_speed on
> case_data (cost=0.00..5.29 rows=3 width=53) (actual time=3.049..3.058
> > rows=1 loops=4906)
> >                                   Index Cond: (('IL081025J'::text
> = (case_data.court_ori)::text) AND ((case_data.case_id)::text =
>  > ("outer".case_id)::text))




From:
Jeremiah Jahn
Date:

Sorry about the formatting.

On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote:
> Jeremiah Jahn wrote:
>
> >here's an example standard query. Ireally have to make the first hit go
> >faster. The table is clustered as well on full_name as well. 'Smith%'
> >took 87 seconds on the first hit. I wonder if I set up may array wrong.
> >I remeber see something about DMA access versus something else, and
> >choose DMA access. LVM maybe?
> >
> >
> It would be nice if you would format your queries to be a little bit
> easier to read before posting them.
> However, I believe I am reading it correctly, to say that the index scan
> on identity is not your slow point. In fact, as near as I can tell, it
> only takes 52ms to complete.
>
> The expensive parts are the 4915 lookups into the litigant_details (each
> one takes approx 4ms for a total of ~20s).
> And then you do it again on case_data (average 3ms each * 4906 loops =
> ~15s).
Is there some way to avoid this?


>
> So there is no need for preloading your indexes on the identity table.
> It is definitely not the bottleneck.
>
> So a few design bits, which may help your database.
> Why is "actor_id" a text field instead of a number?
This is simply due to the nature of the data.

> You could try creating an index on "litigant_details (actor_id,
> count_ori)" so that it can do just an index lookup, rather than an index
> + filter.
I have one, but it doesn't seem to like to use it. Don't really need it
though, I can just drop the court_id out of the query. It's redundant,
since each actor_id is also unique in litigant details. I had run vac
full and analyze but I ran them again anyway and the planning improved.
However, my 14 disk raid 10 array is still slower than my 3 disk raid 5
on my production box. 46sec vs 30sec (with live traffic on the
production) One of the strange things is that when I run the cat command
on my index and tables that are "HOT" it has no effect on memory usage.
Right now I'm running ext3 on LVM. I'm still in a position to redo the
file system and everything. Is this a good way to do it or should I
switch to something else? What about stripe and extent sizes...? kernel
parameters to change?



---------------devel box:-----------------------

copa=# EXPLAIN ANALYZE select
full_name,identity_id,identity.case_id,court.id,date_of_birth,assigned_case_role,litigant_details.impound_litigant_data
copa-# from identity
copa-# join litigant_details on identity.actor_id = litigant_details.actor_id
copa-# join case_data on litigant_details.case_id = case_data.case_id and litigant_details.court_ori =
case_data.court_ori
copa-# join court on identity.court_ori = court.id
copa-# where identity.court_ori = 'IL081025J' and full_name like 'JONES%' order by  full_name;
                                                                       QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=3.29..29482.22 rows=3930 width=86) (actual time=114.060..46001.480 rows=5052 loops=1)
   ->  Nested Loop  (cost=3.29..16193.27 rows=3820 width=112) (actual time=93.038..24584.275 rows=5052 loops=1)
         ->  Nested Loop  (cost=0.00..16113.58 rows=3820 width=113) (actual time=85.778..24536.489 rows=5052 loops=1)
               ->  Index Scan using name_speed on identity  (cost=0.00..824.72 rows=3849 width=82) (actual
time=50.284..150.133rows=5057 loops=1) 
                     Index Cond: (((full_name)::text >= 'JONES'::character varying) AND ((full_name)::text <
'JONET'::charactervarying)) 
                     Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~ 'JONES%'::text))
               ->  Index Scan using lit_actor_speed on litigant_details  (cost=0.00..3.96 rows=1 width=81) (actual
time=4.788..4.812rows=1 loops=5057) 
                     Index Cond: (("outer".actor_id)::text = (litigant_details.actor_id)::text)
         ->  Materialize  (cost=3.29..3.30 rows=1 width=12) (actual time=0.002..0.003 rows=1 loops=5052)
               ->  Seq Scan on court  (cost=0.00..3.29 rows=1 width=12) (actual time=7.248..7.257 rows=1 loops=1)
                     Filter: ('IL081025J'::text = (id)::text)
   ->  Index Scan using case_speed on case_data  (cost=0.00..3.46 rows=1 width=26) (actual time=4.222..4.230 rows=1
loops=5052)
         Index Cond: ((("outer".court_ori)::text = (case_data.court_ori)::text) AND (("outer".case_id)::text =
(case_data.case_id)::text))
 Total runtime: 46005.994 ms



>
> More importantly, though, the planner seems to think the join of
> identity to litigant_details will only return 1 row, not 5000.
> Do you regularly vacuum analyze your tables?
> Just as a test, try running:
> set enable_nested_loop to off;
not quite acceptable
Total runtime: 221486.149 ms

> And then run EXPLAIN ANALYZE again, just to see if it is faster.
>
> You probably need to increase some statistics targets, so that the
> planner can design better plans.

---------------------this is the output from the production box------------------
LOG:  duration: 27213.068 ms  statement: EXPLAIN ANALYZE select
full_name,identity_id,identity.case_id,court.id,date_of_birth,assigned_case_role,litigant_details.impound_litigant_data
        from identity
        join litigant_details on identity.actor_id = litigant_details.actor_id
        join case_data on litigant_details.case_id = case_data.case_id and litigant_details.court_ori =
case_data.court_ori
        join court on identity.court_ori = court.id
        where identity.court_ori = 'IL081025J' and full_name like 'JONES%' order by  full_name;
                                                                       QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=3.29..43498.76 rows=2648 width=86) (actual time=17.106..27192.000 rows=5052 loops=1)
   ->  Nested Loop  (cost=0.00..43442.53 rows=2647 width=87) (actual time=16.947..27120.619 rows=5052 loops=1)
         ->  Nested Loop  (cost=0.00..23061.79 rows=3827 width=113) (actual time=16.801..17390.682 rows=5052 loops=1)
               ->  Index Scan using name_speed on identity  (cost=0.00..1277.39 rows=3858 width=82) (actual
time=9.842..213.424rows=5057 loops=1) 
                     Index Cond: (((full_name)::text >= 'JONES'::character varying) AND ((full_name)::text <
'JONET'::charactervarying)) 
                     Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~ 'JONES%'::text))
               ->  Index Scan using lit_actor_speed on litigant_details  (cost=0.00..5.63 rows=1 width=81) (actual
time=3.355..3.364rows=1 loops=5057) 
                     Index Cond: (("outer".actor_id)::text = (litigant_details.actor_id)::text)
         ->  Index Scan using case_data_pkey on case_data  (cost=0.00..5.31 rows=1 width=26) (actual time=1.897..1.904
rows=1loops=5052) 
               Index Cond: ((("outer".court_ori)::text = (case_data.court_ori)::text) AND (("outer".case_id)::text =
(case_data.case_id)::text))
   ->  Materialize  (cost=3.29..3.30 rows=1 width=12) (actual time=0.002..0.003 rows=1 loops=5052)
         ->  Seq Scan on court  (cost=0.00..3.29 rows=1 width=12) (actual time=0.142..0.165 rows=1 loops=1)
               Filter: ('IL081025J'::text = (id)::text)
 Total runtime: 27205.060 ms

>
>
> John
> =:->
>
--
"I didn't know it was impossible when I did it."


From:
John A Meinel
Date:

Jeremiah Jahn wrote:
> Sorry about the formatting.
>
> On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote:
>
>>Jeremiah Jahn wrote:
>>
>>

...

>>The expensive parts are the 4915 lookups into the litigant_details (each
>>one takes approx 4ms for a total of ~20s).
>>And then you do it again on case_data (average 3ms each * 4906 loops =
>>~15s).
>
> Is there some way to avoid this?
>

Well, in general, 3ms for a single lookup seems really long. Maybe your
index is bloated by not vacuuming often enough. Do you tend to get a lot
of updates to litigant_details?

There are a couple possibilities at this point. First, you can REINDEX
the appropriate index, and see if that helps. However, if this is a test
box, it sounds like you just did a dump and reload, which wouldn't have
bloat in an index.

Another possibility. Is this the column that you usually use when
pulling information out of litigant_details? If so, you can CLUSTER
litigant_details on the appropriate index. This will help things be
close together that should be, which decreases the index lookup costs.

However, if this is not the common column, then you probably will slow
down whatever other accesses you may have on this table.

After CLUSTER, the current data will stay clustered, but new data will
not, so you have to continually CLUSTER, the same way that you might
VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
expensive as a VACUUM FULL. Be aware of this, but it might vastly
improve your performance, so it would be worth it.

>
>
>>So there is no need for preloading your indexes on the identity table.
>>It is definitely not the bottleneck.
>>
>>So a few design bits, which may help your database.
>>Why is "actor_id" a text field instead of a number?
>
> This is simply due to the nature of the data.
>

I'm just wondering if changing into a number, and using a number->name
lookup would be faster for you. It may not be. In general, I prefer to
use numbers for references. I may be over paranoid, but I know that some
locales are bad with string -> string comparisons. And since the data in
your database is stored as UNICODE, I'm not sure if it has to do any
translating or not. Again, something to consider, it may not make any
difference.


>
>>You could try creating an index on "litigant_details (actor_id,
>>count_ori)" so that it can do just an index lookup, rather than an index
>>+ filter.
>
> I have one, but it doesn't seem to like to use it. Don't really need it
> though, I can just drop the court_id out of the query. It's redundant,
> since each actor_id is also unique in litigant details. I had run vac
> full and analyze but I ran them again anyway and the planning improved.
> However, my 14 disk raid 10 array is still slower than my 3 disk raid 5
> on my production box. 46sec vs 30sec (with live traffic on the
> production) One of the strange things is that when I run the cat command
> on my index and tables that are "HOT" it has no effect on memory usage.
> Right now I'm running ext3 on LVM. I'm still in a position to redo the
> file system and everything. Is this a good way to do it or should I
> switch to something else? What about stripe and extent sizes...? kernel
> parameters to change?

Well, the plans are virtually identical. There is one small difference
as to whether it joins against case_data or court first. But 'court' is
very tiny (small enough to use a seqscan instead of index scan) I'm a
little surprised with court being this small that it doesn't do
something like a hash aggregation, but court takes no time anyway.

The real problem is that your nested loop index time is *much* slower.

Devel:
->  Index Scan using lit_actor_speed on litigant_details
    (cost=0.00..3.96 rows=1 width=81)
    (actual time=4.788..4.812 rows=1 loops=5057)

Production:
->  Index Scan using lit_actor_speed on litigant_details
    (cost=0.00..5.63 rows=1 width=81)
    (actual time=3.355..3.364 rows=1 loops=5057)

Devel:
->  Index Scan using case_speed on case_data
    (cost=0.00..3.46 rows=1 width=26)
    (actual time=4.222..4.230 rows=1 loops=5052)

Production:
->  Index Scan using case_data_pkey on case_data
    (cost=0.00..5.31 rows=1 width=26)
    (actual time=1.897..1.904 rows=1 loops=5052)

Notice that the actual per-row cost is as much as 1/2 less than on your
devel box.

As a test, can you do "time cat $index_file >/dev/null" a couple of
times. And then determine the MB/s.
Alternatively run vmstat in another shell. If the read/s doesn't change,
then you know the "cat" is being served from RAM, and thus it really is
cached.

I can point you to REINDEX and CLUSTER, but if it is caching in ram, I
honestly can't say why the per loop would be that much slower.
Are both systems running the same postgres version? It sounds like it is
different (since you say something about switching to 8.0).
I doubt it, but you might try an 8.1devel version.

...

>>Do you regularly vacuum analyze your tables?
>>Just as a test, try running:
>>set enable_nested_loop to off;
>
> not quite acceptable
> Total runtime: 221486.149 ms
>

Well, the estimates are now at least closer (3k vs 5k instead of 1), and
it is still choosing nested loops. So they probably are faster.
I would still be interested in the actual EXPLAIN ANALYZE with nested
loops disabled. It is possible that *some* of the nested loops are
performing worse than they have to.
But really, you have worse index speed, and that needs to be figured out.

John
=:->

From:
Ron
Date:

At 01:18 PM 8/19/2005, John A Meinel wrote:
>Jeremiah Jahn wrote:
> > Sorry about the formatting.
> >
> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote:
> >
> >>Jeremiah Jahn wrote:
> >>
> >>
>
>...
>
> >>The expensive parts are the 4915 lookups into the litigant_details (each
> >>one takes approx 4ms for a total of ~20s).
> >>And then you do it again on case_data (average 3ms each * 4906 loops =
> >>~15s).
> >
> > Is there some way to avoid this?
> >
>
>Well, in general, 3ms for a single lookup seems really long. Maybe your
>index is bloated by not vacuuming often enough. Do you tend to get a lot
>of updates to litigant_details?

Given that the average access time for a 15Krpm HD is in the 5.5-6ms
range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a
single lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm
case) table accesses is requiring a seek.

This implies a poor match between physical layout and access pattern.

If I understand correctly, the table should not be very fragmented
given that this is a reasonably freshly loaded DB?  That implies that
the fields being looked up are not well sorted in the table compared
to the query pattern.

If the entire table could fit in RAM, this would be far less of a
consideration.  Failing that, the physical HD layout has to be
improved or the query pattern has to be changed to reduce seeks.


>There are a couple possibilities at this point. First, you can REINDEX
>the appropriate index, and see if that helps. However, if this is a test
>box, it sounds like you just did a dump and reload, which wouldn't have
>bloat in an index.
>
>Another possibility. Is this the column that you usually use when
>pulling information out of litigant_details? If so, you can CLUSTER
>litigant_details on the appropriate index. This will help things be
>close together that should be, which decreases the index lookup costs.
>
>However, if this is not the common column, then you probably will slow
>down whatever other accesses you may have on this table.
>
>After CLUSTER, the current data will stay clustered, but new data will
>not, so you have to continually CLUSTER, the same way that you might
>VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
>expensive as a VACUUM FULL. Be aware of this, but it might vastly
>improve your performance, so it would be worth it.

CLUSTER can be a very large maintenance overhead/problem if the
table(s) in question actually need to be "continually" re CLUSTER ed.

If there is no better solution available, then you do what you have
to, but it feels like there should be a better answer here.

Perhaps the DB schema needs examining to see if it matches up well
with its real usage?

Ron Peacetree



From:
John A Meinel
Date:

Ron wrote:
> At 01:18 PM 8/19/2005, John A Meinel wrote:
>
>> Jeremiah Jahn wrote:
>> > Sorry about the formatting.
>> >
>> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote:
>> >
>> >>Jeremiah Jahn wrote:
>> >>
>> >>
>>
>> ...
>>
>> >>The expensive parts are the 4915 lookups into the litigant_details
>> (each
>> >>one takes approx 4ms for a total of ~20s).
>> >>And then you do it again on case_data (average 3ms each * 4906 loops =
>> >>~15s).
>> >
>> > Is there some way to avoid this?
>> >
>>
>> Well, in general, 3ms for a single lookup seems really long. Maybe your
>> index is bloated by not vacuuming often enough. Do you tend to get a lot
>> of updates to litigant_details?
>
>
> Given that the average access time for a 15Krpm HD is in the 5.5-6ms
> range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single
> lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case)
> table accesses is requiring a seek.
>


Well, from what he has said, the total indexes are < 1GB and he has 6GB
of ram. So everything should fit. Not to mention he is only accessing
5000/several million rows.


> This implies a poor match between physical layout and access pattern.

This seems to be the case. But since this is not the only query, it may
be that other access patterns are more important to optimize for.

>
> If I understand correctly, the table should not be very fragmented given
> that this is a reasonably freshly loaded DB?  That implies that the
> fields being looked up are not well sorted in the table compared to the
> query pattern.
>
> If the entire table could fit in RAM, this would be far less of a
> consideration.  Failing that, the physical HD layout has to be improved
> or the query pattern has to be changed to reduce seeks.
>
>

...

>> After CLUSTER, the current data will stay clustered, but new data will
>> not, so you have to continually CLUSTER, the same way that you might
>> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
>> expensive as a VACUUM FULL. Be aware of this, but it might vastly
>> improve your performance, so it would be worth it.
>
>
> CLUSTER can be a very large maintenance overhead/problem if the table(s)
> in question actually need to be "continually" re CLUSTER ed.
>
> If there is no better solution available, then you do what you have to,
> but it feels like there should be a better answer here.
>
> Perhaps the DB schema needs examining to see if it matches up well with
> its real usage?
>
> Ron Peacetree
>

I certainly agree that CLUSTER is expensive, and is an on-going
maintenance issue. If it is the normal access pattern, though, it may be
worth it.

I also wonder, though, if his table is properly normalized. Which, as
you mentioned, might lead to improved access patterns.

John
=:->

From:
Jeremiah Jahn
Date:

On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote:
> Jeremiah Jahn wrote:
> > Sorry about the formatting.
> >
> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote:
> >
> >>Jeremiah Jahn wrote:
> >>
> >>
>
> ...
>
> >>The expensive parts are the 4915 lookups into the litigant_details (each
> >>one takes approx 4ms for a total of ~20s).
> >>And then you do it again on case_data (average 3ms each * 4906 loops =
> >>~15s).
> >
> > Is there some way to avoid this?
> >
>
> Well, in general, 3ms for a single lookup seems really long. Maybe your
> index is bloated by not vacuuming often enough. Do you tend to get a lot
> of updates to litigant_details?
I have vacuumed this already. I get lots of updates, but this data is
mostly unchanging.

>
> There are a couple possibilities at this point. First, you can REINDEX
> the appropriate index, and see if that helps. However, if this is a test
> box, it sounds like you just did a dump and reload, which wouldn't have
> bloat in an index.

I loaded it using slony

>
> Another possibility. Is this the column that you usually use when
> pulling information out of litigant_details? If so, you can CLUSTER
> litigant_details on the appropriate index. This will help things be
> close together that should be, which decreases the index lookup costs.
clustering on this right now. Most of the other things are already
clustered. name and case_data

>
> However, if this is not the common column, then you probably will slow
> down whatever other accesses you may have on this table.
>
> After CLUSTER, the current data will stay clustered, but new data will
> not, so you have to continually CLUSTER, the same way that you might
> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
> expensive as a VACUUM FULL. Be aware of this, but it might vastly
> improve your performance, so it would be worth it.
I generally re-cluster once a week.
>
> >
> >
> >>So there is no need for preloading your indexes on the identity table.
> >>It is definitely not the bottleneck.
> >>
> >>So a few design bits, which may help your database.
> >>Why is "actor_id" a text field instead of a number?
> >
> > This is simply due to the nature of the data.
> >
>
> I'm just wondering if changing into a number, and using a number->name
> lookup would be faster for you. It may not be. In general, I prefer to
> use numbers for references. I may be over paranoid, but I know that some
> locales are bad with string -> string comparisons. And since the data in
> your database is stored as UNICODE, I'm not sure if it has to do any
> translating or not. Again, something to consider, it may not make any
> difference.
I don't believe so. I initialze the DB as 'lang=C'. I used to have the
problem where things were being inited as en_US. this would prevent any
text based index from working. This doesn't seem to be the case here, so
I'm not worried about it.


>
>
> >
> >>You could try creating an index on "litigant_details (actor_id,
> >>count_ori)" so that it can do just an index lookup, rather than an index
> >>+ filter.
> >
> > I have one, but it doesn't seem to like to use it. Don't really need it
> > though, I can just drop the court_id out of the query. It's redundant,
> > since each actor_id is also unique in litigant details. I had run vac
> > full and analyze but I ran them again anyway and the planning improved.
> > However, my 14 disk raid 10 array is still slower than my 3 disk raid 5
> > on my production box. 46sec vs 30sec (with live traffic on the
> > production) One of the strange things is that when I run the cat command
> > on my index and tables that are "HOT" it has no effect on memory usage.
> > Right now I'm running ext3 on LVM. I'm still in a position to redo the
> > file system and everything. Is this a good way to do it or should I
> > switch to something else? What about stripe and extent sizes...? kernel
> > parameters to change?
>
> Well, the plans are virtually identical. There is one small difference
> as to whether it joins against case_data or court first. But 'court' is
> very tiny (small enough to use a seqscan instead of index scan) I'm a
> little surprised with court being this small that it doesn't do
> something like a hash aggregation, but court takes no time anyway.
>
> The real problem is that your nested loop index time is *much* slower.
>
> Devel:
> ->  Index Scan using lit_actor_speed on litigant_details
>     (cost=0.00..3.96 rows=1 width=81)
>     (actual time=4.788..4.812 rows=1 loops=5057)
>
> Production:
> ->  Index Scan using lit_actor_speed on litigant_details
>     (cost=0.00..5.63 rows=1 width=81)
>     (actual time=3.355..3.364 rows=1 loops=5057)
>
> Devel:
> ->  Index Scan using case_speed on case_data
>     (cost=0.00..3.46 rows=1 width=26)
>     (actual time=4.222..4.230 rows=1 loops=5052)
>
> Production:
> ->  Index Scan using case_data_pkey on case_data
>     (cost=0.00..5.31 rows=1 width=26)
>     (actual time=1.897..1.904 rows=1 loops=5052)
>
> Notice that the actual per-row cost is as much as 1/2 less than on your
> devel box.
>
> As a test, can you do "time cat $index_file >/dev/null" a couple of
> times. And then determine the MB/s.
> Alternatively run vmstat in another shell. If the read/s doesn't change,
> then you know the "cat" is being served from RAM, and thus it really is
> cached.
it's cached alright. I'm getting a read rate of about 150MB/sec. I would
have thought is would be faster with my raid setup. I think I'm going to
scrap the whole thing and get rid of LVM. I'll just do a straight ext3
system. Maybe that will help. Still trying to get suggestions for a
stripe size.

>
> I can point you to REINDEX and CLUSTER, but if it is caching in ram, I
> honestly can't say why the per loop would be that much slower.
> Are both systems running the same postgres version? It sounds like it is
> different (since you say something about switching to 8.0).
These had little or no effect.
The production machine is running 7.4 while the devel machine is running
8.0

> I doubt it, but you might try an 8.1devel version.
>
> ...
>
> >>Do you regularly vacuum analyze your tables?
> >>Just as a test, try running:
> >>set enable_nested_loop to off;
> >
> > not quite acceptable
> > Total runtime: 221486.149 ms
> >
>
> Well, the estimates are now at least closer (3k vs 5k instead of 1), and
> it is still choosing nested loops. So they probably are faster.
> I would still be interested in the actual EXPLAIN ANALYZE with nested
> loops disabled. It is possible that *some* of the nested loops are
> performing worse than they have to.

this is a cached version.

> copa=> explain analyze select
full_name,identity_id,identity.case_id,court.id,date_of_birth,assigned_case_role,litigant_details.impound_litigant_data
> copa-> from identity
> copa-> join litigant_details on identity.actor_id = litigant_details.actor_id
> copa-> join case_data on litigant_details.case_id = case_data.case_id and litigant_details.court_ori =
case_data.court_ori
> copa-> join court on identity.court_ori = court.id
> copa-> where identity.court_ori = 'IL081025J' and full_name like 'SMITH%' order by  full_name;
>                                                                                      QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=100502560.72..100502583.47 rows=9099 width=86) (actual time=17843.876..17849.401 rows=8094 loops=1)
>    Sort Key: identity.full_name
>    ->  Merge Join  (cost=100311378.72..100501962.40 rows=9099 width=86) (actual time=15195.816..17817.847 rows=8094
loops=1)
>          Merge Cond: ((("outer".court_ori)::text = "inner"."?column10?") AND (("outer".case_id)::text =
"inner"."?column11?"))
>          ->  Index Scan using case_speed on case_data  (cost=0.00..170424.73 rows=3999943 width=26) (actual
time=0.015..4540.525rows=3018284 loops=1) 
>          ->  Sort  (cost=100311378.72..100311400.82 rows=8839 width=112) (actual time=9594.985..9601.174 rows=8094
loops=1)
>                Sort Key: (litigant_details.court_ori)::text, (litigant_details.case_id)::text
>                ->  Nested Loop  (cost=100002491.43..100310799.34 rows=8839 width=112) (actual time=6892.755..9555.828
rows=8094loops=1) 
>                      ->  Seq Scan on court  (cost=0.00..3.29 rows=1 width=12) (actual time=0.085..0.096 rows=1
loops=1)
>                            Filter: ('IL081025J'::text = (id)::text)
>                      ->  Merge Join  (cost=2491.43..310707.66 rows=8839 width=113) (actual time=6892.656..9519.680
rows=8094loops=1) 
>                            Merge Cond: (("outer".actor_id)::text = "inner"."?column7?")
>                            ->  Index Scan using lit_actor_speed on litigant_details  (cost=0.00..295722.00
rows=4956820width=81) (actual time=0.027..5613.814 rows=3736703 loops=1) 
>                            ->  Sort  (cost=2491.43..2513.71 rows=8913 width=82) (actual time=116.071..122.272
rows=8100loops=1) 
>                                  Sort Key: (identity.actor_id)::text
>                                  ->  Index Scan using name_speed on identity  (cost=0.00..1906.66 rows=8913 width=82)
(actualtime=0.133..81.104 rows=8100 loops=1) 
>                                        Index Cond: (((full_name)::text >= 'SMITH'::character varying) AND
((full_name)::text< 'SMITI'::character varying)) 
>                                        Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~
'SMITH%'::text))
>  Total runtime: 17859.917 ms

> But really, you have worse index speed, and that needs to be figured out.
>
> John
> =:->
--
Speak softly and carry a +6 two-handed sword.


From:
Jeremiah Jahn
Date:

On Fri, 2005-08-19 at 14:23 -0500, John A Meinel wrote:
> Ron wrote:
> > At 01:18 PM 8/19/2005, John A Meinel wrote:
> >
> >> Jeremiah Jahn wrote:
> >> > Sorry about the formatting.
> >> >
> >> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote:
> >> >
> >> >>Jeremiah Jahn wrote:
> >> >>
> >> >>
> >>
> >> ...
> >>
> >> >>The expensive parts are the 4915 lookups into the litigant_details
> >> (each
> >> >>one takes approx 4ms for a total of ~20s).
> >> >>And then you do it again on case_data (average 3ms each * 4906 loops =
> >> >>~15s).
> >> >
> >> > Is there some way to avoid this?
> >> >
> >>
> >> Well, in general, 3ms for a single lookup seems really long. Maybe your
> >> index is bloated by not vacuuming often enough. Do you tend to get a lot
> >> of updates to litigant_details?
> >
> >
> > Given that the average access time for a 15Krpm HD is in the 5.5-6ms
> > range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single
> > lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case)
> > table accesses is requiring a seek.
> >
I think LVM may be a problem, since it also seems to break things up on
the file system. My access time on the seek should be around 1/7th the
15Krpm I believe since it's a 14 disk raid 10 array. And no other
traffic at the moment.



>
>
> Well, from what he has said, the total indexes are < 1GB and he has 6GB
> of ram. So everything should fit. Not to mention he is only accessing
> 5000/several million rows.
I table spaced some of the indexes and they are around 211066880 bytes
for the name_speed index and 149825330 for the lit_actor_speed index
tables seem to be about a gig.

>
>
> > This implies a poor match between physical layout and access pattern.
>
> This seems to be the case. But since this is not the only query, it may
> be that other access patterns are more important to optimize for.
>
> >
> > If I understand correctly, the table should not be very fragmented given
> > that this is a reasonably freshly loaded DB?  That implies that the
> > fields being looked up are not well sorted in the table compared to the
> > query pattern.
> >
> > If the entire table could fit in RAM, this would be far less of a
> > consideration.  Failing that, the physical HD layout has to be improved
> > or the query pattern has to be changed to reduce seeks.
> >
> >
>
> ...
>
> >> After CLUSTER, the current data will stay clustered, but new data will
> >> not, so you have to continually CLUSTER, the same way that you might
> >> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
> >> expensive as a VACUUM FULL. Be aware of this, but it might vastly
> >> improve your performance, so it would be worth it.
> >
> >
> > CLUSTER can be a very large maintenance overhead/problem if the table(s)
> > in question actually need to be "continually" re CLUSTER ed.
> >
> > If there is no better solution available, then you do what you have to,
> > but it feels like there should be a better answer here.
> >
> > Perhaps the DB schema needs examining to see if it matches up well with
> > its real usage?
> >
> > Ron Peacetree
> >
>
> I certainly agree that CLUSTER is expensive, and is an on-going
> maintenance issue. If it is the normal access pattern, though, it may be
> worth it.

The query I've sent you is one of the most common I get just change the
name. I handle about 180K of them a day mostly between 8 and 5. The
clustering has never really been a problem. Like I said before I do it
about once a week. I handle about 3000 update an hour consisting of
about 1000-3000 statement per update. ie about 2.5 million updates per
hour. In the last few months or so I've filtered these down to about
400K update/delete/insert statements per hour.

>
> I also wonder, though, if his table is properly normalized. Which, as
> you mentioned, might lead to improved access patterns.
The system is about as normalized as I can get it. In general the layout
is the following:
courts have cases, cases have litigant_details. Actors have identities
and litigant_details.

>
> John
> =:->
--
Speak softly and carry a +6 two-handed sword.


From:
Jeremiah Jahn
Date:

Rebuild in progress with just ext3 on the raid array...will see if this
helps the access times. If it doesn't I'll mess with the stripe size. I
have REINDEXED, CLUSTERED, tablespaced and cached with 'cat table/index
> /dev/null' none of this seems to have helped, or even increased my
memory usage. argh! The only thing about this new system that I'm
unfamiliar with is the array setup and LVM, which is why I think that's
where the issue is. clustering and indexing as well as vacuum etc are
things that I do and have been aware of for sometime. Perhaps slony is a
factor, but I really don't see it causing problems on index read speed
esp. when it's not running.

thanx for your help, I really appreciate it.
-jj-




On Fri, 2005-08-19 at 14:23 -0500, John A Meinel wrote:
> Ron wrote:
> > At 01:18 PM 8/19/2005, John A Meinel wrote:
> >
> >> Jeremiah Jahn wrote:
> >> > Sorry about the formatting.
> >> >
> >> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote:
> >> >
> >> >>Jeremiah Jahn wrote:
> >> >>
> >> >>
> >>
> >> ...
> >>
> >> >>The expensive parts are the 4915 lookups into the litigant_details
> >> (each
> >> >>one takes approx 4ms for a total of ~20s).
> >> >>And then you do it again on case_data (average 3ms each * 4906 loops =
> >> >>~15s).
> >> >
> >> > Is there some way to avoid this?
> >> >
> >>
> >> Well, in general, 3ms for a single lookup seems really long. Maybe your
> >> index is bloated by not vacuuming often enough. Do you tend to get a lot
> >> of updates to litigant_details?
> >
> >
> > Given that the average access time for a 15Krpm HD is in the 5.5-6ms
> > range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single
> > lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case)
> > table accesses is requiring a seek.
> >
>
>
> Well, from what he has said, the total indexes are < 1GB and he has 6GB
> of ram. So everything should fit. Not to mention he is only accessing
> 5000/several million rows.
>
>
> > This implies a poor match between physical layout and access pattern.
>
> This seems to be the case. But since this is not the only query, it may
> be that other access patterns are more important to optimize for.
>
> >
> > If I understand correctly, the table should not be very fragmented given
> > that this is a reasonably freshly loaded DB?  That implies that the
> > fields being looked up are not well sorted in the table compared to the
> > query pattern.
> >
> > If the entire table could fit in RAM, this would be far less of a
> > consideration.  Failing that, the physical HD layout has to be improved
> > or the query pattern has to be changed to reduce seeks.
> >
> >
>
> ...
>
> >> After CLUSTER, the current data will stay clustered, but new data will
> >> not, so you have to continually CLUSTER, the same way that you might
> >> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
> >> expensive as a VACUUM FULL. Be aware of this, but it might vastly
> >> improve your performance, so it would be worth it.
> >
> >
> > CLUSTER can be a very large maintenance overhead/problem if the table(s)
> > in question actually need to be "continually" re CLUSTER ed.
> >
> > If there is no better solution available, then you do what you have to,
> > but it feels like there should be a better answer here.
> >
> > Perhaps the DB schema needs examining to see if it matches up well with
> > its real usage?
> >
> > Ron Peacetree
> >
>
> I certainly agree that CLUSTER is expensive, and is an on-going
> maintenance issue. If it is the normal access pattern, though, it may be
> worth it.
>
> I also wonder, though, if his table is properly normalized. Which, as
> you mentioned, might lead to improved access patterns.
>
> John
> =:->
--
Speak softly and carry a +6 two-handed sword.


From:
John A Meinel
Date:

Jeremiah Jahn wrote:
> On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote:
>
>>Jeremiah Jahn wrote:
>>


...

>>
>>Well, in general, 3ms for a single lookup seems really long. Maybe your
>>index is bloated by not vacuuming often enough. Do you tend to get a lot
>>of updates to litigant_details?
>
> I have vacuumed this already. I get lots of updates, but this data is
> mostly unchanging.
>
>
>>There are a couple possibilities at this point. First, you can REINDEX
>>the appropriate index, and see if that helps. However, if this is a test
>>box, it sounds like you just did a dump and reload, which wouldn't have
>>bloat in an index.
>
>
> I loaded it using slony

I don't know that slony versus pg_dump/pg_restore really matters. The
big thing is that Updates wouldn't be trashing your index.
But if you are saying that you cluster once/wk your index can't be that
messed up anyway. (Unless CLUSTER messes up the non-clustered indexes,
but that would make cluster much less useful, so I would have guessed
this was not the case)

>
>
>>Another possibility. Is this the column that you usually use when
>>pulling information out of litigant_details? If so, you can CLUSTER
>>litigant_details on the appropriate index. This will help things be
>>close together that should be, which decreases the index lookup costs.
>
> clustering on this right now. Most of the other things are already
> clustered. name and case_data

Just as a reality check, they are clustered on the columns in question,
right? (I don't know if this column is a primary key or not, but any
index can be used for clustering).

>
>
>>However, if this is not the common column, then you probably will slow
>>down whatever other accesses you may have on this table.
>>
>>After CLUSTER, the current data will stay clustered, but new data will
>>not, so you have to continually CLUSTER, the same way that you might
>>VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
>>expensive as a VACUUM FULL. Be aware of this, but it might vastly
>>improve your performance, so it would be worth it.
>
> I generally re-cluster once a week.
>
>>>
>>>>So there is no need for preloading your indexes on the identity table.
>>>>It is definitely not the bottleneck.
>>>>
>>>>So a few design bits, which may help your database.
>>>>Why is "actor_id" a text field instead of a number?
>>>
>>>This is simply due to the nature of the data.
>>>
>>
>>I'm just wondering if changing into a number, and using a number->name
>>lookup would be faster for you. It may not be. In general, I prefer to
>>use numbers for references. I may be over paranoid, but I know that some
>>locales are bad with string -> string comparisons. And since the data in
>>your database is stored as UNICODE, I'm not sure if it has to do any
>>translating or not. Again, something to consider, it may not make any
>>difference.
>
> I don't believe so. I initialze the DB as 'lang=C'. I used to have the
> problem where things were being inited as en_US. this would prevent any
> text based index from working. This doesn't seem to be the case here, so
> I'm not worried about it.
>

Sorry, I think I was confusing you with someone else who posted SHOW ALL.

>
>
>>

...

> it's cached alright. I'm getting a read rate of about 150MB/sec. I would
> have thought is would be faster with my raid setup. I think I'm going to
> scrap the whole thing and get rid of LVM. I'll just do a straight ext3
> system. Maybe that will help. Still trying to get suggestions for a
> stripe size.
>

I don't think 150MB/s is out of the realm for a 14 drive array.
How fast is
time dd if=/dev/zero of=testfile bs=8192 count=1000000
(That should create a 8GB file, which is too big to cache everything)
And then how fast is:
time dd if=testfile of=/dev/null bs=8192 count=1000000

That should give you a semi-decent way of measuring how fast the RAID
system is, since it should be too big to cache in ram.

>
>>I can point you to REINDEX and CLUSTER, but if it is caching in ram, I
>>honestly can't say why the per loop would be that much slower.
>>Are both systems running the same postgres version? It sounds like it is
>>different (since you say something about switching to 8.0).
>
> These had little or no effect.
> The production machine is running 7.4 while the devel machine is running
> 8.0
>

Well, my concern is that maybe some portion of the 8.0 code actually
slowed things down for you. You could try reverting to 7.4 on the devel
box, though I think playing with upgrading to 8.1 might be more worthwhile.

...

>
> this is a cached version.
>

I assume that you mean this is the second run of the query. I can't
compare it too much, since this is "smith" rather than "jones". But this
one is 17s rather than the other one being 46s.

And that includes having 8k rows instead of having 5k rows.

Have you tried other values with disabled nested loops? Because this
query (at least in cached form) seems to be *way* faster than with
nested loops.
I know that you somehow managed to get 200s in your testing, but it
might just be that whatever needed to be loaded is now loaded, and you
would get better performance.
If this is true, it means you might need to tweak some settings, and
make sure your statistics are decent, so that postgres can actually pick
the optimal plan.

>
>>copa=> explain analyze select
full_name,identity_id,identity.case_id,court.id,date_of_birth,assigned_case_role,litigant_details.impound_litigant_data
>>copa-> from identity
>>copa-> join litigant_details on identity.actor_id = litigant_details.actor_id
>>copa-> join case_data on litigant_details.case_id = case_data.case_id and litigant_details.court_ori =
case_data.court_ori
>>copa-> join court on identity.court_ori = court.id
>>copa-> where identity.court_ori = 'IL081025J' and full_name like 'SMITH%' order by  full_name;
>>                                                                                     QUERY PLAN

>>-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> Sort  (cost=100502560.72..100502583.47 rows=9099 width=86) (actual time=17843.876..17849.401 rows=8094 loops=1)
>>   Sort Key: identity.full_name
>>   ->  Merge Join  (cost=100311378.72..100501962.40 rows=9099 width=86) (actual time=15195.816..17817.847 rows=8094
loops=1)
>>         Merge Cond: ((("outer".court_ori)::text = "inner"."?column10?") AND (("outer".case_id)::text =
"inner"."?column11?"))
>>         ->  Index Scan using case_speed on case_data  (cost=0.00..170424.73 rows=3999943 width=26) (actual
time=0.015..4540.525rows=3018284 loops=1) 
>>         ->  Sort  (cost=100311378.72..100311400.82 rows=8839 width=112) (actual time=9594.985..9601.174 rows=8094
loops=1)
>>               Sort Key: (litigant_details.court_ori)::text, (litigant_details.case_id)::text
>>               ->  Nested Loop  (cost=100002491.43..100310799.34 rows=8839 width=112) (actual time=6892.755..9555.828
rows=8094loops=1) 
>>                     ->  Seq Scan on court  (cost=0.00..3.29 rows=1 width=12) (actual time=0.085..0.096 rows=1
loops=1)
>>                           Filter: ('IL081025J'::text = (id)::text)

What I don't really understand is the next part. It seems to be doing an
index scan on 3.7M rows, and getting very decent performance (5s), and
then merging against a table which returns only 8k rows.
Why is it having to look through all of those rows?
I may be missing something, but this says it is able to do 600 index
lookups / millisecond. Which seems superfast. (Compared to your earlier
4ms / lookup)

Something fishy is going on here.


>>                     ->  Merge Join  (cost=2491.43..310707.66 rows=8839 width=113) (actual time=6892.656..9519.680
rows=8094loops=1) 
>>                           Merge Cond: (("outer".actor_id)::text = "inner"."?column7?")
>>                           ->  Index Scan using lit_actor_speed on litigant_details  (cost=0.00..295722.00
rows=4956820width=81) (actual time=0.027..5613.814 rows=3736703 loops=1) 
>>                           ->  Sort  (cost=2491.43..2513.71 rows=8913 width=82) (actual time=116.071..122.272
rows=8100loops=1) 
>>                                 Sort Key: (identity.actor_id)::text
>>                                 ->  Index Scan using name_speed on identity  (cost=0.00..1906.66 rows=8913 width=82)
(actualtime=0.133..81.104 rows=8100 loops=1) 
>>                                       Index Cond: (((full_name)::text >= 'SMITH'::character varying) AND
((full_name)::text< 'SMITI'::character varying)) 
>>                                       Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~
'SMITH%'::text))
>> Total runtime: 17859.917 ms
>
>
>>But really, you have worse index speed, and that needs to be figured out.
>>
>>John
>>=:->

I'm assuming your data is private (since it looks like legal stuff).
Unless maybe that makes it part of the public record.
Anyway, I'm not able to, but sometimes someone like Tom can profile
stuff to see what is going on.

I might just be messing up my ability to read the explain output. But
somehow things don't seem to be lining up with the cost of a single
index lookup.
On my crappy Celeron 450 box, an index lookup is 0.06ms once things are
cached in ram.

John
=:->



From:
John A Meinel
Date:

Jeremiah Jahn wrote:
> Rebuild in progress with just ext3 on the raid array...will see if this
> helps the access times. If it doesn't I'll mess with the stripe size. I
> have REINDEXED, CLUSTERED, tablespaced and cached with 'cat table/index
>
>>/dev/null' none of this seems to have helped, or even increased my
>
> memory usage. argh! The only thing about this new system that I'm
> unfamiliar with is the array setup and LVM, which is why I think that's
> where the issue is. clustering and indexing as well as vacuum etc are
> things that I do and have been aware of for sometime. Perhaps slony is a
> factor, but I really don't see it causing problems on index read speed
> esp. when it's not running.
>
> thanx for your help, I really appreciate it.
> -jj-
>

By the way, how are you measuring memory usage? Can you give the output
of that command, just to make sure you are reading it correctly.

John
=:->


From:
Dan Harris
Date:

On Aug 19, 2005, at 3:01 PM, Jeremiah Jahn wrote:

> Rebuild in progress with just ext3 on the raid array...will see if
> this
> helps the access times.

 From my recent experiences, I can say ext3 is probably not a great
choice for Pg databases.  If you check the archives you'll see
there's a lot of discussion about various journalling filesystems and
ext3 usually(always?) comes up on the bottom as far as performance
goes.  If you insist on using it, I would at least recommend the
noatime option in fstab and using data=writeback to get the faster of
the journal modes.

XFS seems to be a trusted choice, followed by Reiser and JFS both
with the occasional controversy when the comparisons pop up.

-Dan


From:
Marko Ristola
Date:

Dan Harris wrote:

> From my recent experiences, I can say ext3 is probably not a great
> choice for Pg databases.  If you check the archives you'll see
> there's a lot of discussion about various journalling filesystems and
> ext3 usually(always?) comes up on the bottom as far as performance
> goes.  If you insist on using it, I would at least recommend the
> noatime option in fstab and using data=writeback to get the faster


Based on my knoledge, Ext3 is good with keeping filesystem integrity AND
data integrity while
pressing the reset button.
However, by selecting data=writeback, you gain more speed, but you risk the
data integrity during a crash: Ext3 garantees only filesystem integrity.

This means with database transaction logs: The last transactions are not
guaranteed to be written into the hard drives during a hardware reset,
meaning of a loss of some committed transactions.

Reiserfs is known to do things this false way also.
Is there a way with a Reiserfs filesystem to fulfill both
filesystem AND data integrity requirements nowadays?

See for example "man mount" to see the effects of data=journal,
data=ordered(default) and
data=writeback for Ext3. Only the writeback risks data integrity.

Ext3 is the only journaled filesystem, that I know that fulfills
these fundamental data integrity guarantees. Personally I like about
such filesystems, even though it means less speed.

Marko Ristola




From:
Michael Stone
Date:

On Sat, Aug 20, 2005 at 01:12:15AM -0600, Dan Harris wrote:
>XFS seems to be a trusted choice, followed by Reiser and JFS both
>with the occasional controversy when the comparisons pop up.

And don't put the xlog on a journaled filesystem. There is no advantage
to doing so, and it will slow things down. (Assuming a sane seperate xlog
partition configuration, sized reasonably.)

Mike Stone

From:
Michael Stone
Date:

On Sat, Aug 20, 2005 at 02:17:54PM +0300, Marko Ristola wrote:
>Based on my knoledge, Ext3 is good with keeping filesystem integrity
>AND data integrity while pressing the reset button.  However, by
>selecting data=writeback, you gain more speed, but you risk the data
>integrity during a crash: Ext3 garantees only filesystem integrity.

That's why postgres keeps its own transaction log. Any of these
filesystems guarantee data integrity for data that's been synced to
disk, and postgres keeps track of what data has and has not been
committed so it can recover gracefully from a failure. That's why most
filesystems are designed the way they are; the application can determine
what things need better data integrity and which need better performance
on a case-by-case basis.

Mike Stone

From:
Tom Lane
Date:

Michael Stone <> writes:
> On Sat, Aug 20, 2005 at 02:17:54PM +0300, Marko Ristola wrote:
>> Based on my knoledge, Ext3 is good with keeping filesystem integrity
>> AND data integrity while pressing the reset button.  However, by
>> selecting data=writeback, you gain more speed, but you risk the data
>> integrity during a crash: Ext3 garantees only filesystem integrity.

> That's why postgres keeps its own transaction log. Any of these
> filesystems guarantee data integrity for data that's been synced to
> disk, and postgres keeps track of what data has and has not been
> committed so it can recover gracefully from a failure.

Right.  I think the optimal setting for a Postgres data directory is
journaled metadata, non-journaled file content.  Postgres can take care
of the data integrity for itself, but it does assume that the filesystem
stays structurally sane (eg, data blocks don't get reassigned to the
wrong file), so you need a filesystem guarantee about the metadata.

WAL files are handled in a much more conservative way (created, filled
with zeroes, and fsync'd before we ever put any valuable data in 'em).
If you have WAL on its own drive then I think Mike's recommendation of
no filesystem journalling at all for that drive is probably OK.  Or
you can do same as above (journal metadata only) if you want a little
extra protection.

And of course all this reasoning depends on the assumption that the
drive tells the truth about write-completion.  If the drive does write
caching it had better be able to complete all its accepted writes before
dying in a power failure.  (Hence, battery-backed write cache is OK, any
other kind is evil.)

            regards, tom lane

From:
Ron
Date:

At 04:11 PM 8/19/2005, Jeremiah Jahn wrote:
>On Fri, 2005-08-19 at 14:23 -0500, John A Meinel wrote:
> > Ron wrote:
> > > At 01:18 PM 8/19/2005, John A Meinel wrote:
> > >
> > >> Jeremiah Jahn wrote:
> > >> > Sorry about the formatting.
> > >> >
> > >> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote:
> > >> >
> > >> >>Jeremiah Jahn wrote:
> > >> >>
> > >> >>
> > >>
> > >> ...
> > >>
> > >> >>The expensive parts are the 4915 lookups into the litigant_details
> > >> (each
> > >> >>one takes approx 4ms for a total of ~20s).
> > >> >>And then you do it again on case_data (average 3ms each * 4906 loops =
> > >> >>~15s).
> > >> >
> > >> > Is there some way to avoid this?
> > >> >
> > >>
> > >> Well, in general, 3ms for a single lookup seems really long. Maybe your
> > >> index is bloated by not vacuuming often enough. Do you tend to get a lot
> > >> of updates to litigant_details?
> > >
> > >
> > > Given that the average access time for a 15Krpm HD is in the 5.5-6ms
> > > range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single
> > > lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case)
> > > table accesses is requiring a seek.
> > >
>I think LVM may be a problem, since it also seems to break things up on
>the file system. My access time on the seek should be around 1/7th the
>15Krpm I believe since it's a 14 disk raid 10 array. And no other
>traffic at the moment.

Oops.  There's a misconception here.  RAID arrays increase
_throughput_ AKA _bandwidth_ through parallel access to HDs.  OTOH,
access time is _latency_, and that is not changed.  Access time for a
RAID set is equal to that of the slowest access time, AKA highest
latency, HD in the RAID set.

> > Well, from what he has said, the total indexes are < 1GB and he has 6GB
> > of ram. So everything should fit. Not to mention he is only accessing
> > 5000/several million rows.
>I table spaced some of the indexes and they are around 211066880 bytes
>for the name_speed index and 149825330 for the lit_actor_speed index
>tables seem to be about a gig.

Hmm.  And you think you are only using 250MB out of your 6GB of
RAM?  Something doesn't seem to add up here.  From what's been
posted, I'd expect much more RAM to be in use.


> > > This implies a poor match between physical layout and access pattern.
> >
> > This seems to be the case. But since this is not the only query, it may
> > be that other access patterns are more important to optimize for.
> >
> > >
> > > If I understand correctly, the table should not be very fragmented given
> > > that this is a reasonably freshly loaded DB?  That implies that the
> > > fields being looked up are not well sorted in the table compared to the
> > > query pattern.
> > >
> > > If the entire table could fit in RAM, this would be far less of a
> > > consideration.  Failing that, the physical HD layout has to be improved
> > > or the query pattern has to be changed to reduce seeks.
> > >
> > >
> >
> > ...
> >
> > >> After CLUSTER, the current data will stay clustered, but new data will
> > >> not, so you have to continually CLUSTER, the same way that you might
> > >> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
> > >> expensive as a VACUUM FULL. Be aware of this, but it might vastly
> > >> improve your performance, so it would be worth it.
> > >
> > >
> > > CLUSTER can be a very large maintenance overhead/problem if the table(s)
> > > in question actually need to be "continually" re CLUSTER ed.
> > >
> > > If there is no better solution available, then you do what you have to,
> > > but it feels like there should be a better answer here.
> > >
> > > Perhaps the DB schema needs examining to see if it matches up well with
> > > its real usage?
> > >
> > > Ron Peacetree
> > >
> >
> > I certainly agree that CLUSTER is expensive, and is an on-going
> > maintenance issue. If it is the normal access pattern, though, it may be
> > worth it.
>
>The query I've sent you is one of the most common I get just change the
>name. I handle about 180K of them a day mostly between 8 and 5. The
>clustering has never really been a problem. Like I said before I do it
>about once a week. I handle about 3000 update an hour consisting of
>about 1000-3000 statement per update. ie about 2.5 million updates per
>hour. In the last few months or so I've filtered these down to about
>400K update/delete/insert statements per hour.

2.5M updates per hour = ~695 updates per second.  400K per hour =
~112 updates per sec.  These should be well within the capabilities
of a RAID 10 subsystem based on 14 15Krpm HDs assuming a decent RAID
card.  What is the exact HW of the RAID subsystem involved and how is
it configured?  You shouldn't be having a performance problem AFAICT...

> > I also wonder, though, if his table is properly normalized.
> Which, as you mentioned, might lead to improved access patterns.
>The system is about as normalized as I can get it. In general the
>layout is the following:
>courts have cases, cases have litigant_details. Actors have
>identities and litigant_details.

Hmmm.  Can you tell us more about the actual schema, I may have an idea...

>  >
> > John
> > =:->
>--
>Speak softly and carry a +6 two-handed sword.

Nah.  A wand of 25th level automatic Magic Missile Fire ;-)

Ron Peacetree



From:
Jeremiah Jahn
Date:

I'm just watching gnome-system-monoitor. Which after careful
consideration.....and looking at dstat means I'm on CRACK....GSM isn't
showing cached memory usage....I asume that the cache memory usage is
where data off of the disks would be cached...?



memory output from dstat is this for  a few seconds:

---procs--- ------memory-usage----- ---paging-- --disk/sda----disk/sdb- ----swap--- ----total-cpu-usage----
run blk new|_used _buff _cach _free|__in_ _out_|_read write:_read write|_used _free|usr sys idl wai hiq siq
  0   0   0|1336M   10M 4603M   17M| 490B  833B|3823B 3503k:1607k 4285k| 160k 2048M|  4   1  89   7   0   0
  1   0   0|1337M   10M 4600M   18M|   0     0 |   0     0 :   0   464k| 160k 2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0     0 |   0     0 :   0     0 | 160k 2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0     0 |   0    48k:   0     0 | 160k 2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0     0 |   0     0 :   0     0 | 160k 2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0     0 |   0   132k:   0     0 | 160k 2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0     0 |   0    36k:   0     0 | 160k 2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0     0 |   0     0 :   0     0 | 160k 2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0     0 |   0    12k:   0     0 | 160k 2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0     0 |   0     0 :   0     0 | 160k 2048M| 25   0  75   0   0   0
  2   0   0|1353M   10M 4585M   18M|   0     0 |   0     0 :   0     0 | 160k 2048M| 25   1  75   0   0   0
  1   0   0|1321M   10M 4616M   19M|   0     0 |   0     0 :   0     0 | 160k 2048M| 18   8  74   0   0   0
  1   0   0|1326M   10M 4614M   17M|   0     0 |   0     0 :4096B    0 | 160k 2048M| 16  10  74   1   0   0
  1   0   0|1330M   10M 4609M   17M|   0     0 |   0    12k:4096B    0 | 160k 2048M| 17   9  74   0   0   0
  0   1   0|1343M   10M 4596M   17M|   0     0 |   0     0 :   0   316M| 160k 2048M|  5  10  74  11   0   1
  0   1   0|1339M   10M 4596M   21M|   0     0 |   0     0 :   0     0 | 160k 2048M|  0   0  74  25   0   1
  0   2   0|1334M   10M 4596M   25M|   0     0 |   0  4096B:   0     0 | 160k 2048M|  0   0  54  44   0   1
  1   0   0|1326M   10M 4596M   34M|   0     0 |   0     0 :   0   364k| 160k 2048M|  4   1  60  34   0   1
  1   0   0|1290M   10M 4596M   70M|   0     0 |   0    12k:   0     0 | 160k 2048M| 24   1  75   0   0   0
  1   0   0|1301M   10M 4596M   59M|   0     0 |   0    20k:   0     0 | 160k 2048M| 21   4  75   0   0   0
  1   0   0|1312M   10M 4596M   48M|   0     0 |   0     0 :   0     0 | 160k 2048M| 22   4  75   0   0   0
  1   0   0|1323M   10M 4596M   37M|   0     0 |   0     0 :   0    24k| 160k 2048M| 21   4  75   0   0   0
  1   0   0|1334M   10M 4596M   25M|   0     0 |   0     0 :   0    56k| 160k 2048M| 21   4  75   0   0   0



On Fri, 2005-08-19 at 16:07 -0500, John A Meinel wrote:
> Jeremiah Jahn wrote:
> > Rebuild in progress with just ext3 on the raid array...will see if this
> > helps the access times. If it doesn't I'll mess with the stripe size. I
> > have REINDEXED, CLUSTERED, tablespaced and cached with 'cat table/index
> >
> >>/dev/null' none of this seems to have helped, or even increased my
> >
> > memory usage. argh! The only thing about this new system that I'm
> > unfamiliar with is the array setup and LVM, which is why I think that's
> > where the issue is. clustering and indexing as well as vacuum etc are
> > things that I do and have been aware of for sometime. Perhaps slony is a
> > factor, but I really don't see it causing problems on index read speed
> > esp. when it's not running.
> >
> > thanx for your help, I really appreciate it.
> > -jj-
> >
>
> By the way, how are you measuring memory usage? Can you give the output
> of that command, just to make sure you are reading it correctly.
>
> John
> =:->
>
--
Speak softly and carry a +6 two-handed sword.


From:
Jeremiah Jahn
Date:

On Sat, 2005-08-20 at 11:59 -0400, Ron wrote:
> At 04:11 PM 8/19/2005, Jeremiah Jahn wrote:
> >On Fri, 2005-08-19 at 14:23 -0500, John A Meinel wrote:
> > > Ron wrote:
> > > > At 01:18 PM 8/19/2005, John A Meinel wrote:
> > > >
> > > >> Jeremiah Jahn wrote:
> > > >> > Sorry about the formatting.
> > > >> >
> > > >> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote:
> > > >> >
> > > >> >>Jeremiah Jahn wrote:
> > > >> >>
> > > >> >>
> > > >>
> > > >> ...
> > > >>
> > > >> >>The expensive parts are the 4915 lookups into the litigant_details
> > > >> (each
> > > >> >>one takes approx 4ms for a total of ~20s).
> > > >> >>And then you do it again on case_data (average 3ms each * 4906 loops =
> > > >> >>~15s).
> > > >> >
> > > >> > Is there some way to avoid this?
> > > >> >
> > > >>
> > > >> Well, in general, 3ms for a single lookup seems really long. Maybe your
> > > >> index is bloated by not vacuuming often enough. Do you tend to get a lot
> > > >> of updates to litigant_details?
> > > >
> > > >
> > > > Given that the average access time for a 15Krpm HD is in the 5.5-6ms
> > > > range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single
> > > > lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case)
> > > > table accesses is requiring a seek.
> > > >
> >I think LVM may be a problem, since it also seems to break things up on
> >the file system. My access time on the seek should be around 1/7th the
> >15Krpm I believe since it's a 14 disk raid 10 array. And no other
> >traffic at the moment.
>
> Oops.  There's a misconception here.  RAID arrays increase
> _throughput_ AKA _bandwidth_ through parallel access to HDs.  OTOH,
> access time is _latency_, and that is not changed.  Access time for a
> RAID set is equal to that of the slowest access time, AKA highest
> latency, HD in the RAID set.

so I will max out at the 5.5-6ms rang for access time?


>
> > > Well, from what he has said, the total indexes are < 1GB and he has 6GB
> > > of ram. So everything should fit. Not to mention he is only accessing
> > > 5000/several million rows.
> >I table spaced some of the indexes and they are around 211066880 bytes
> >for the name_speed index and 149825330 for the lit_actor_speed index
> >tables seem to be about a gig.
>
> Hmm.  And you think you are only using 250MB out of your 6GB of
> RAM?  Something doesn't seem to add up here.  From what's been
> posted, I'd expect much more RAM to be in use.

the cached memory usage is complete using up the rest of the memory.

>
>
> > > > This implies a poor match between physical layout and access pattern.
> > >
> > > This seems to be the case. But since this is not the only query, it may
> > > be that other access patterns are more important to optimize for.
> > >
> > > >
> > > > If I understand correctly, the table should not be very fragmented given
> > > > that this is a reasonably freshly loaded DB?  That implies that the
> > > > fields being looked up are not well sorted in the table compared to the
> > > > query pattern.
> > > >
> > > > If the entire table could fit in RAM, this would be far less of a
> > > > consideration.  Failing that, the physical HD layout has to be improved
> > > > or the query pattern has to be changed to reduce seeks.
> > > >
> > > >
> > >
> > > ...
> > >
> > > >> After CLUSTER, the current data will stay clustered, but new data will
> > > >> not, so you have to continually CLUSTER, the same way that you might
> > > >> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
> > > >> expensive as a VACUUM FULL. Be aware of this, but it might vastly
> > > >> improve your performance, so it would be worth it.
> > > >
> > > >
> > > > CLUSTER can be a very large maintenance overhead/problem if the table(s)
> > > > in question actually need to be "continually" re CLUSTER ed.
> > > >
> > > > If there is no better solution available, then you do what you have to,
> > > > but it feels like there should be a better answer here.
> > > >
> > > > Perhaps the DB schema needs examining to see if it matches up well with
> > > > its real usage?
> > > >
> > > > Ron Peacetree
> > > >
> > >
> > > I certainly agree that CLUSTER is expensive, and is an on-going
> > > maintenance issue. If it is the normal access pattern, though, it may be
> > > worth it.
> >
> >The query I've sent you is one of the most common I get just change the
> >name. I handle about 180K of them a day mostly between 8 and 5. The
> >clustering has never really been a problem. Like I said before I do it
> >about once a week. I handle about 3000 update an hour consisting of
> >about 1000-3000 statement per update. ie about 2.5 million updates per
> >hour. In the last few months or so I've filtered these down to about
> >400K update/delete/insert statements per hour.
>
> 2.5M updates per hour = ~695 updates per second.  400K per hour =
> ~112 updates per sec.  These should be well within the capabilities
> of a RAID 10 subsystem based on 14 15Krpm HDs assuming a decent RAID
> card.  What is the exact HW of the RAID subsystem involved and how is
> it configured?  You shouldn't be having a performance problem AFAICT...

dell perc4 with 14 drives and the each pair is raid 1 with spanning
enabled across all of the pairs. It doesn't say raid 10...But it seem to
be it. What else would you like to know?

>
> > > I also wonder, though, if his table is properly normalized.
> > Which, as you mentioned, might lead to improved access patterns.
> >The system is about as normalized as I can get it. In general the
> >layout is the following:
> >courts have cases, cases have litigant_details. Actors have
> >identities and litigant_details.
>
> Hmmm.  Can you tell us more about the actual schema, I may have an idea...
In what format would you like it. What kind of things would you like to
know..? I've probably missed a few things, but this is what running on
the production box. There are no foreign keys. Cascading delete were far
too slow. And having to determine the order of deletes was a pain in the
but.



CREATE TABLE actor (
    actor_id character varying(50) NOT NULL,
    case_id character varying(50) DEFAULT '0'::character varying NOT NULL,
    court_ori character varying(18) NOT NULL,
    role_class_code character varying(50) NOT NULL
);



CREATE TABLE identity (
    identity_id character varying(50) NOT NULL,
    actor_id character varying(50) NOT NULL,
    case_id character varying(50) DEFAULT '0'::character varying NOT NULL,
    court_ori character varying(18) NOT NULL,
    identity_type character varying(10) NOT NULL,
    entity_type character varying(50),
    full_name character varying(60) NOT NULL,
    entity_acronym character varying(50),
    name_prefix character varying(50),
    first_name character varying(50),
    middle_name character varying(50),
    last_name character varying(50),
    name_suffix character varying(50),
    gender_code character varying(50),
    date_of_birth date,
    place_of_birth character varying(50),
    height character varying(50),
    height_unit character varying(50),
    weight character varying(50),
    weight_unit character varying(50),
    religion character varying(50),
    ethnicity character varying(50),
    citizenship_country character varying(50),
    hair_color character varying(50),
    eye_color character varying(50),
    scars_marks_tatto character varying(255),
    marital_status character varying(50)
);
ALTER TABLE ONLY identity ALTER COLUMN full_name SET STATISTICS 1000;



CREATE TABLE case_data (
    case_id character varying(50) NOT NULL,
    court_ori character varying(18) NOT NULL,
    type_code character varying(50),
    subtype_code character varying(50),
    case_category character varying(50),
    case_title character varying(100),
    type_subtype_text character varying(255),
    case_year integer,
    extraction_datetime character varying(15) NOT NULL,
    update_date date NOT NULL,
    case_dom oid,
    data bytea
);



CREATE TABLE litigant_details (
    actor_id character varying(50) NOT NULL,
    case_id character varying(50) NOT NULL,
    court_ori character varying(18) NOT NULL,
    assigned_case_role character varying(50) NOT NULL,
    initial_file_date date,
    initial_close_date date,
    reopen_date date,
    reclose_date date,
    physical_file_location character varying(50),
    impound_litigant_data character varying(50),
    impound_litigant_minutes character varying(50),
    actor_type character varying(50) NOT NULL,
    conviction character varying(3)
);



CREATE TABLE actor_identifier (
    identity_id character varying(50) NOT NULL,
    actor_id character varying(50) NOT NULL,
    case_id character varying(50) DEFAULT '0'::character varying NOT NULL,
    court_ori character varying(18) NOT NULL,
    actor_identifier_type_code character varying(50) NOT NULL,
    actor_identifier_id character varying(50) NOT NULL
);



CREATE TABLE actor_relationship (
    litigant_actor_id character varying(50) NOT NULL,
    related_actor_id character varying(50) NOT NULL,
    case_id character varying(50) NOT NULL,
    court_ori character varying(18) NOT NULL,
    relationship_type character varying(50) NOT NULL
);

CREATE INDEX lit_actor_speed ON litigant_details USING btree (actor_id);

CREATE INDEX name_speed ON identity USING btree (full_name);
ALTER TABLE identity CLUSTER ON name_speed;

CREATE INDEX case_speed ON case_data USING btree (court_ori, case_id);
ALTER TABLE case_data CLUSTER ON case_speed;


ALTER TABLE ONLY actor
    ADD CONSTRAINT actor_pkey PRIMARY KEY (court_ori, case_id, actor_id);
ALTER TABLE ONLY identity
    ADD CONSTRAINT identity_pkey PRIMARY KEY (court_ori, case_id, identity_id, actor_id);
ALTER TABLE ONLY case_data
    ADD CONSTRAINT case_data_pkey PRIMARY KEY (court_ori, case_id);
ALTER TABLE ONLY litigant_details
    ADD CONSTRAINT litigant_details_pkey PRIMARY KEY (actor_id, case_id, court_ori);



>
> >  >
> > > John
> > > =:->
> >--
> >Speak softly and carry a +6 two-handed sword.
>
> Nah.  A wand of 25th level automatic Magic Missile Fire ;-)
>
> Ron Peacetree
>
--
Speak softly and carry a +6 two-handed sword.


From:
Jeremiah Jahn
Date:

On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote:
> Jeremiah Jahn wrote:
> > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote:
> >
> >>Jeremiah Jahn wrote:
> >>
>
>
> ...
>
> >>
> >>Well, in general, 3ms for a single lookup seems really long. Maybe your
> >>index is bloated by not vacuuming often enough. Do you tend to get a lot
> >>of updates to litigant_details?
> >
> > I have vacuumed this already. I get lots of updates, but this data is
> > mostly unchanging.
> >
> >
> >>There are a couple possibilities at this point. First, you can REINDEX
> >>the appropriate index, and see if that helps. However, if this is a test
> >>box, it sounds like you just did a dump and reload, which wouldn't have
> >>bloat in an index.
> >
> >
> > I loaded it using slony
>
> I don't know that slony versus pg_dump/pg_restore really matters. The
> big thing is that Updates wouldn't be trashing your index.
> But if you are saying that you cluster once/wk your index can't be that
> messed up anyway. (Unless CLUSTER messes up the non-clustered indexes,
> but that would make cluster much less useful, so I would have guessed
> this was not the case)
>
> >
> >
> >>Another possibility. Is this the column that you usually use when
> >>pulling information out of litigant_details? If so, you can CLUSTER
> >>litigant_details on the appropriate index. This will help things be
> >>close together that should be, which decreases the index lookup costs.
> >
> > clustering on this right now. Most of the other things are already
> > clustered. name and case_data
>
> Just as a reality check, they are clustered on the columns in question,
> right? (I don't know if this column is a primary key or not, but any
> index can be used for clustering).
>
> >
> >
> >>However, if this is not the common column, then you probably will slow
> >>down whatever other accesses you may have on this table.
> >>
> >>After CLUSTER, the current data will stay clustered, but new data will
> >>not, so you have to continually CLUSTER, the same way that you might
> >>VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
> >>expensive as a VACUUM FULL. Be aware of this, but it might vastly
> >>improve your performance, so it would be worth it.
> >
> > I generally re-cluster once a week.
> >
> >>>
> >>>>So there is no need for preloading your indexes on the identity table.
> >>>>It is definitely not the bottleneck.
> >>>>
> >>>>So a few design bits, which may help your database.
> >>>>Why is "actor_id" a text field instead of a number?
> >>>
> >>>This is simply due to the nature of the data.
> >>>
> >>
> >>I'm just wondering if changing into a number, and using a number->name
> >>lookup would be faster for you. It may not be. In general, I prefer to
> >>use numbers for references. I may be over paranoid, but I know that some
> >>locales are bad with string -> string comparisons. And since the data in
> >>your database is stored as UNICODE, I'm not sure if it has to do any
> >>translating or not. Again, something to consider, it may not make any
> >>difference.
> >
> > I don't believe so. I initialze the DB as 'lang=C'. I used to have the
> > problem where things were being inited as en_US. this would prevent any
> > text based index from working. This doesn't seem to be the case here, so
> > I'm not worried about it.
> >
>
> Sorry, I think I was confusing you with someone else who posted SHOW ALL.
>
> >
> >
> >>
>
> ...
>
> > it's cached alright. I'm getting a read rate of about 150MB/sec. I would
> > have thought is would be faster with my raid setup. I think I'm going to
> > scrap the whole thing and get rid of LVM. I'll just do a straight ext3
> > system. Maybe that will help. Still trying to get suggestions for a
> > stripe size.
> >
>
> I don't think 150MB/s is out of the realm for a 14 drive array.
> How fast is
> time dd if=/dev/zero of=testfile bs=8192 count=1000000
time dd if=/dev/zero of=testfile bs=8192 count=1000000
1000000+0 records in
1000000+0 records out

real    1m24.248s
user    0m0.381s
sys     0m33.028s


> (That should create a 8GB file, which is too big to cache everything)
> And then how fast is:
> time dd if=testfile of=/dev/null bs=8192 count=1000000

time dd if=testfile of=/dev/null bs=8192 count=1000000
1000000+0 records in
1000000+0 records out

real    0m54.139s
user    0m0.326s
sys     0m8.916s


and on a second run:

real    0m55.667s
user    0m0.341s
sys     0m9.013s


>
> That should give you a semi-decent way of measuring how fast the RAID
> system is, since it should be too big to cache in ram.

about 150MB/Sec. Is there no better way to make this go faster...?

>
> >
> >>I can point you to REINDEX and CLUSTER, but if it is caching in ram, I
> >>honestly can't say why the per loop would be that much slower.
> >>Are both systems running the same postgres version? It sounds like it is
> >>different (since you say something about switching to 8.0).
> >
> > These had little or no effect.
> > The production machine is running 7.4 while the devel machine is running
> > 8.0
> >
>
> Well, my concern is that maybe some portion of the 8.0 code actually
> slowed things down for you. You could try reverting to 7.4 on the devel
> box, though I think playing with upgrading to 8.1 might be more worthwhile.
And the level of stability for 8.1? I started with 7.4 and it didn't
really feel as fast as it should either.

>
> ...
>
> >
> > this is a cached version.
> >
>
> I assume that you mean this is the second run of the query. I can't
> compare it too much, since this is "smith" rather than "jones". But this
> one is 17s rather than the other one being 46s.
>
> And that includes having 8k rows instead of having 5k rows.
>
> Have you tried other values with disabled nested loops? Because this
> query (at least in cached form) seems to be *way* faster than with
> nested loops.
> I know that you somehow managed to get 200s in your testing, but it
> might just be that whatever needed to be loaded is now loaded, and you
> would get better performance.
> If this is true, it means you might need to tweak some settings, and
> make sure your statistics are decent, so that postgres can actually pick
> the optimal plan.
>
> >
> >>copa=> explain analyze select
full_name,identity_id,identity.case_id,court.id,date_of_birth,assigned_case_role,litigant_details.impound_litigant_data
> >>copa-> from identity
> >>copa-> join litigant_details on identity.actor_id = litigant_details.actor_id
> >>copa-> join case_data on litigant_details.case_id = case_data.case_id and litigant_details.court_ori =
case_data.court_ori
> >>copa-> join court on identity.court_ori = court.id
> >>copa-> where identity.court_ori = 'IL081025J' and full_name like 'SMITH%' order by  full_name;
> >>                                                                                     QUERY PLAN
>
>>-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >> Sort  (cost=100502560.72..100502583.47 rows=9099 width=86) (actual time=17843.876..17849.401 rows=8094 loops=1)
> >>   Sort Key: identity.full_name
> >>   ->  Merge Join  (cost=100311378.72..100501962.40 rows=9099 width=86) (actual time=15195.816..17817.847 rows=8094
loops=1)
> >>         Merge Cond: ((("outer".court_ori)::text = "inner"."?column10?") AND (("outer".case_id)::text =
"inner"."?column11?"))
> >>         ->  Index Scan using case_speed on case_data  (cost=0.00..170424.73 rows=3999943 width=26) (actual
time=0.015..4540.525rows=3018284 loops=1) 
> >>         ->  Sort  (cost=100311378.72..100311400.82 rows=8839 width=112) (actual time=9594.985..9601.174 rows=8094
loops=1)
> >>               Sort Key: (litigant_details.court_ori)::text, (litigant_details.case_id)::text
> >>               ->  Nested Loop  (cost=100002491.43..100310799.34 rows=8839 width=112) (actual
time=6892.755..9555.828rows=8094 loops=1) 
> >>                     ->  Seq Scan on court  (cost=0.00..3.29 rows=1 width=12) (actual time=0.085..0.096 rows=1
loops=1)
> >>                           Filter: ('IL081025J'::text = (id)::text)
>
> What I don't really understand is the next part. It seems to be doing an
> index scan on 3.7M rows, and getting very decent performance (5s), and
> then merging against a table which returns only 8k rows.
> Why is it having to look through all of those rows?
> I may be missing something, but this says it is able to do 600 index
> lookups / millisecond. Which seems superfast. (Compared to your earlier
> 4ms / lookup)
>
Makes me a little confused myself...

> Something fishy is going on here.
>
>
> >>                     ->  Merge Join  (cost=2491.43..310707.66 rows=8839 width=113) (actual time=6892.656..9519.680
rows=8094loops=1) 
> >>                           Merge Cond: (("outer".actor_id)::text = "inner"."?column7?")
> >>                           ->  Index Scan using lit_actor_speed on litigant_details  (cost=0.00..295722.00
rows=4956820width=81) (actual time=0.027..5613.814 rows=3736703 loops=1) 
> >>                           ->  Sort  (cost=2491.43..2513.71 rows=8913 width=82) (actual time=116.071..122.272
rows=8100loops=1) 
> >>                                 Sort Key: (identity.actor_id)::text
> >>                                 ->  Index Scan using name_speed on identity  (cost=0.00..1906.66 rows=8913
width=82)(actual time=0.133..81.104 rows=8100 loops=1) 
> >>                                       Index Cond: (((full_name)::text >= 'SMITH'::character varying) AND
((full_name)::text< 'SMITI'::character varying)) 
> >>                                       Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~
'SMITH%'::text))
> >> Total runtime: 17859.917 ms
> >
> >
> >>But really, you have worse index speed, and that needs to be figured out.
> >>
> >>John
> >>=:->
>
> I'm assuming your data is private (since it looks like legal stuff).
> Unless maybe that makes it part of the public record.
> Anyway, I'm not able to, but sometimes someone like Tom can profile
> stuff to see what is going on.
I've had tom on here before..:) not my devel box, but my production box
a couple of years ago.


>
> I might just be messing up my ability to read the explain output. But
> somehow things don't seem to be lining up with the cost of a single
> index lookup.
> On my crappy Celeron 450 box, an index lookup is 0.06ms once things are
> cached in ram.
>
> John
> =:->
>
>
--
Speak softly and carry a +6 two-handed sword.


From:
Ron
Date:

At 02:53 PM 8/20/2005, Jeremiah Jahn wrote:
>On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote:
> > Jeremiah Jahn wrote:
> > > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote:
> > >
><snip>
> >
> > > it's cached alright. I'm getting a read rate of about 150MB/sec. I would
> > > have thought is would be faster with my raid setup. I think I'm going to
> > > scrap the whole thing and get rid of LVM. I'll just do a straight ext3
> > > system. Maybe that will help. Still trying to get suggestions for a
> > > stripe size.
> > >
> >
> > I don't think 150MB/s is out of the realm for a 14 drive array.
> > How fast is time dd if=/dev/zero of=testfile bs=8192 count=1000000
> >
>time dd if=/dev/zero of=testfile bs=8192 count=1000000
>1000000+0 records in
>1000000+0 records out
>
>real    1m24.248s
>user    0m0.381s
>sys     0m33.028s
>
>
> > (That should create a 8GB file, which is too big to cache everything)
> > And then how fast is:
> > time dd if=testfile of=/dev/null bs=8192 count=1000000
>
>time dd if=testfile of=/dev/null bs=8192 count=1000000
>1000000+0 records in
>1000000+0 records out
>
>real    0m54.139s
>user    0m0.326s
>sys     0m8.916s
>
>
>and on a second run:
>
>real    0m55.667s
>user    0m0.341s
>sys     0m9.013s
>
>
> >
> > That should give you a semi-decent way of measuring how fast the RAID
> > system is, since it should be too big to cache in ram.
>
>about 150MB/Sec. Is there no better way to make this go faster...?
Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of
them doing raw sequential IO like this should be capable of at
  ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's,  ~7*79MB/s=
553MB/s if using Fujitsu MAU's, and ~7*86MB/s= 602MB/s if using
Maxtor Atlas 15K II's to devices external to the RAID array.

_IF_ the controller setup is high powered enough to keep that kind of
IO rate up.  This will require a controller or controllers providing
dual channel U320 bandwidth externally and quad channel U320
bandwidth internally.  IOW, it needs a controller or controllers
talking 64b 133MHz PCI-X, reasonably fast DSP/CPU units, and probably
a decent sized IO buffer as well.

AFAICT, the Dell PERC4 controllers use various flavors of the LSI
Logic MegaRAID controllers.  What I don't know is which exact one
yours is, nor do I know if it (or any of the MegaRAID controllers)
are high powered enough.

Talk to your HW supplier to make sure you have controllers adequate
to your HD's.

...and yes, your average access time will be in the 5.5ms - 6ms range
when doing a physical seek.
Even with RAID, you want to minimize seeks and maximize sequential IO
when accessing them.
Best to not go to HD at all ;-)

Hope this helps,
Ron Peacetree





From:
Ron
Date:

I'm reposting this because my mailer hiccuped when I sent it the
first time.  If this results in a double post, I apologize.

At 02:53 PM 8/20/2005, Jeremiah Jahn wrote:
>On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote:
> > Jeremiah Jahn wrote:
> > > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote:
> > >
><snip>
> >
> > > it's cached alright. I'm getting a read rate of about 150MB/sec. I would
> > > have thought is would be faster with my raid setup. I think I'm going to
> > > scrap the whole thing and get rid of LVM. I'll just do a straight ext3
> > > system. Maybe that will help. Still trying to get suggestions for a
> > > stripe size.
> > >
> >
> > I don't think 150MB/s is out of the realm for a 14 drive array.
> > How fast is time dd if=/dev/zero of=testfile bs=8192 count=1000000
> >
>time dd if=/dev/zero of=testfile bs=8192 count=1000000
>1000000+0 records in
>1000000+0 records out
>
>real    1m24.248s
>user    0m0.381s
>sys     0m33.028s
>
>
> > (That should create a 8GB file, which is too big to cache everything)
> > And then how fast is:
> > time dd if=testfile of=/dev/null bs=8192 count=1000000
>
>time dd if=testfile of=/dev/null bs=8192 count=1000000
>1000000+0 records in
>1000000+0 records out
>
>real    0m54.139s
>user    0m0.326s
>sys     0m8.916s
>
>
>and on a second run:
>
>real    0m55.667s
>user    0m0.341s
>sys     0m9.013s
>
>
> >
> > That should give you a semi-decent way of measuring how fast the RAID
> > system is, since it should be too big to cache in ram.
>
>about 150MB/Sec. Is there no better way to make this go faster...?
Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of
them doing raw sequential IO like this should be capable of at
  ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's,  ~7*79MB/s=
553MB/s if using Fujitsu MAU's, and ~7*86MB/s= 602MB/s if using
Maxtor Atlas 15K II's to devices external to the RAID array.

_IF_ the controller setup is high powered enough to keep that kind of
IO rate up.  This will require a controller or controllers providing
dual channel U320 bandwidth externally and quad channel U320
bandwidth internally.  IOW, it needs a controller or controllers
talking 64b 133MHz PCI-X, reasonably fast DSP/CPU units, and probably
a decent sized IO buffer as well.

AFAICT, the Dell PERC4 controllers use various flavors of the LSI
Logic MegaRAID controllers.  What I don't know is which exact one
yours is, nor do I know if it (or any of the MegaRAID controllers)
are high powered enough.

Talk to your HW supplier to make sure you have controllers adequate
to your HD's.

...and yes, your average access time will be in the 5.5ms - 6ms range
when doing a physical seek.
Even with RAID, you want to minimize seeks and maximize sequential IO
when accessing them.
Best to not go to HD at all ;-)

Hope this helps,
Ron Peacetree





From:
Ron
Date:

At 02:16 PM 8/20/2005, Jeremiah Jahn wrote:
>I'm just watching gnome-system-monoitor. Which after careful
>consideration.....and looking at dstat means I'm on CRACK....GSM isn't
>showing cached memory usage....I asume that the cache memory usage
>is where data off of the disks would be cached...?
>
>memory output from dstat is this for  a few seconds:
>
>---procs--- ------memory-usage----- ---paging--
>--disk/sda----disk/sdb- ----swap--- ----total-cpu-usage----
>run blk new|_used _buff _cach _free|__in_ _out_|_read write:_read
>write|_used _free|usr sys idl wai hiq siq
>   0   0   0|1336M   10M 4603M   17M| 490B  833B|3823B 3503k:1607k
> 4285k| 160k 2048M|  4   1  89   7   0   0
>   1   0   0|1337M   10M 4600M   18M|   0     0 |   0     0
> :   0   464k| 160k 2048M| 25   0  75   0   0   0
><snip>
>   1   0   0|1334M   10M 4596M   25M|   0     0 |   0     0
> :   0    56k| 160k 2048M| 21   4  75   0   0   0

Then the "low memory usage" was a chimera.  Excellent!

Given the evidence in this thread, IMO you should upgrade your box to
16GB of RAM ASAP.  That should be enough to cache most, if not all,
of the 10GB of the "hot" part of your DB; thereby dedicating your HD
subsystem as much as possible to writes (which is unavoidable HD
IO).  As I've posted before, at $75-$150/GB, it's well worth the
investment whenever you can prove it will help as we have here.

Hope this helps,
Ron Peacetree




From:
John A Meinel
Date:

Jeremiah Jahn wrote:
> I'm just watching gnome-system-monoitor. Which after careful
> consideration.....and looking at dstat means I'm on CRACK....GSM isn't
> showing cached memory usage....I asume that the cache memory usage is
> where data off of the disks would be cached...?
>

Well a simple "free" also tells you how much has been cached.
I believe by reading the _cach line, it looks like you have 4.6G cached.
So you are indeed using memory.

I'm still concerned why it seems to be taking 3-4ms per index lookup,
when things should already be cached in RAM.
Now, I may be wrong about whether the indexes are cached, but I sure
would expect them to be.
What is the time for a cached query on your system (with normal nested
loops)? (give the EXPLAIN ANALYZE for the *second* run, or maybe the
fourth).

I'm glad that we aren't seeing something weird with your kernel, at least.

John
=:->


>
>
> memory output from dstat is this for  a few seconds:
>
> ---procs--- ------memory-usage----- ---paging-- --disk/sda----disk/sdb- ----swap--- ----total-cpu-usage----
> run blk new|_used _buff _cach _free|__in_ _out_|_read write:_read write|_used _free|usr sys idl wai hiq siq
>   0   0   0|1336M   10M 4603M   17M| 490B  833B|3823B 3503k:1607k 4285k| 160k 2048M|  4   1  89   7   0   0
>   1   0   0|1337M   10M 4600M   18M|   0     0 |   0     0 :   0   464k| 160k 2048M| 25   0  75   0   0   0
>   1   0   0|1337M   10M 4600M   18M|   0     0 |   0     0 :   0     0 | 160k 2048M| 25   0  75   0   0   0
>   1   0   0|1337M   10M 4600M   18M|   0     0 |   0    48k:   0     0 | 160k 2048M| 25   0  75   0   0   0
>   1   0   0|1337M   10M 4600M   18M|   0     0 |   0     0 :   0     0 | 160k 2048M| 25   0  75   0   0   0
>   1   0   0|1337M   10M 4600M   18M|   0     0 |   0   132k:   0     0 | 160k 2048M| 25   0  75   0   0   0
>   1   0   0|1337M   10M 4600M   18M|   0     0 |   0    36k:   0     0 | 160k 2048M| 25   0  75   0   0   0
>   1   0   0|1337M   10M 4600M   18M|   0     0 |   0     0 :   0     0 | 160k 2048M| 25   0  75   0   0   0
>   1   0   0|1337M   10M 4600M   18M|   0     0 |   0    12k:   0     0 | 160k 2048M| 25   0  75   0   0   0
>   1   0   0|1337M   10M 4600M   18M|   0     0 |   0     0 :   0     0 | 160k 2048M| 25   0  75   0   0   0
>   2   0   0|1353M   10M 4585M   18M|   0     0 |   0     0 :   0     0 | 160k 2048M| 25   1  75   0   0   0
>   1   0   0|1321M   10M 4616M   19M|   0     0 |   0     0 :   0     0 | 160k 2048M| 18   8  74   0   0   0
>   1   0   0|1326M   10M 4614M   17M|   0     0 |   0     0 :4096B    0 | 160k 2048M| 16  10  74   1   0   0
>   1   0   0|1330M   10M 4609M   17M|   0     0 |   0    12k:4096B    0 | 160k 2048M| 17   9  74   0   0   0
>   0   1   0|1343M   10M 4596M   17M|   0     0 |   0     0 :   0   316M| 160k 2048M|  5  10  74  11   0   1
>   0   1   0|1339M   10M 4596M   21M|   0     0 |   0     0 :   0     0 | 160k 2048M|  0   0  74  25   0   1
>   0   2   0|1334M   10M 4596M   25M|   0     0 |   0  4096B:   0     0 | 160k 2048M|  0   0  54  44   0   1
>   1   0   0|1326M   10M 4596M   34M|   0     0 |   0     0 :   0   364k| 160k 2048M|  4   1  60  34   0   1
>   1   0   0|1290M   10M 4596M   70M|   0     0 |   0    12k:   0     0 | 160k 2048M| 24   1  75   0   0   0
>   1   0   0|1301M   10M 4596M   59M|   0     0 |   0    20k:   0     0 | 160k 2048M| 21   4  75   0   0   0
>   1   0   0|1312M   10M 4596M   48M|   0     0 |   0     0 :   0     0 | 160k 2048M| 22   4  75   0   0   0
>   1   0   0|1323M   10M 4596M   37M|   0     0 |   0     0 :   0    24k| 160k 2048M| 21   4  75   0   0   0
>   1   0   0|1334M   10M 4596M   25M|   0     0 |   0     0 :   0    56k| 160k 2048M| 21   4  75   0   0   0
>
>
>
> On Fri, 2005-08-19 at 16:07 -0500, John A Meinel wrote:
>
>>Jeremiah Jahn wrote:
>>
>>>Rebuild in progress with just ext3 on the raid array...will see if this
>>>helps the access times. If it doesn't I'll mess with the stripe size. I
>>>have REINDEXED, CLUSTERED, tablespaced and cached with 'cat table/index
>>>
>>>
>>>>/dev/null' none of this seems to have helped, or even increased my
>>>
>>>memory usage. argh! The only thing about this new system that I'm
>>>unfamiliar with is the array setup and LVM, which is why I think that's
>>>where the issue is. clustering and indexing as well as vacuum etc are
>>>things that I do and have been aware of for sometime. Perhaps slony is a
>>>factor, but I really don't see it causing problems on index read speed
>>>esp. when it's not running.
>>>
>>>thanx for your help, I really appreciate it.
>>>-jj-
>>>
>>
>>By the way, how are you measuring memory usage? Can you give the output
>>of that command, just to make sure you are reading it correctly.
>>
>>John
>>=:->
>>


From:
John A Meinel
Date:

Ron wrote:
> At 02:53 PM 8/20/2005, Jeremiah Jahn wrote:
>
>> On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote:
>> > Jeremiah Jahn wrote:
>> > > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote:
>> > >
>> <snip>
>> >
>> > > it's cached alright. I'm getting a read rate of about 150MB/sec. I
>> would
>> > > have thought is would be faster with my raid setup. I think I'm
>> going to
>> > > scrap the whole thing and get rid of LVM. I'll just do a straight
>> ext3
>> > > system. Maybe that will help. Still trying to get suggestions for a
>> > > stripe size.
>> > >

Well, since you can get a read of the RAID at 150MB/s, that means that
it is actual I/O speed. It may not be cached in RAM. Perhaps you could
try the same test, only using say 1G, which should be cached.

>> >
>> > I don't think 150MB/s is out of the realm for a 14 drive array.
>> > How fast is time dd if=/dev/zero of=testfile bs=8192 count=1000000
>> >
>> time dd if=/dev/zero of=testfile bs=8192 count=1000000
>> 1000000+0 records in
>> 1000000+0 records out
>>
>> real    1m24.248s
>> user    0m0.381s
>> sys     0m33.028s
>>
>>
>> > (That should create a 8GB file, which is too big to cache everything)
>> > And then how fast is:
>> > time dd if=testfile of=/dev/null bs=8192 count=1000000
>>
>> time dd if=testfile of=/dev/null bs=8192 count=1000000
>> 1000000+0 records in
>> 1000000+0 records out
>>
>> real    0m54.139s
>> user    0m0.326s
>> sys     0m8.916s
>>
>>
>> and on a second run:
>>
>> real    0m55.667s
>> user    0m0.341s
>> sys     0m9.013s
>>
>>
>> >
>> > That should give you a semi-decent way of measuring how fast the RAID
>> > system is, since it should be too big to cache in ram.
>>
>> about 150MB/Sec. Is there no better way to make this go faster...?

I'm actually curious about PCI bus saturation at this point. Old 32-bit
33MHz pci could only push 1Gbit = 100MB/s. Now, I'm guessing that this
is a higher performance system. But I'm really surprised that your write
speed is that close to your read speed. (100MB/s write, 150MB/s read).

>
> Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them
> doing raw sequential IO like this should be capable of at
>  ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's,  ~7*79MB/s= 553MB/s
> if using Fujitsu MAU's, and ~7*86MB/s= 602MB/s if using Maxtor Atlas 15K
> II's to devices external to the RAID array.

I know I thought these were SATA drives, over 2 controllers. I could be
completely wrong, though.

>
> _IF_ the controller setup is high powered enough to keep that kind of IO
> rate up.  This will require a controller or controllers providing dual
> channel U320 bandwidth externally and quad channel U320 bandwidth
> internally.  IOW, it needs a controller or controllers talking 64b
> 133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized
> IO buffer as well.
>
> AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic
> MegaRAID controllers.  What I don't know is which exact one yours is,
> nor do I know if it (or any of the MegaRAID controllers) are high
> powered enough.
>
> Talk to your HW supplier to make sure you have controllers adequate to
> your HD's.
>
> ...and yes, your average access time will be in the 5.5ms - 6ms range
> when doing a physical seek.
> Even with RAID, you want to minimize seeks and maximize sequential IO
> when accessing them.
> Best to not go to HD at all ;-)

Well, certainly, if you can get more into RAM, you're always better off.
For writing, a battery-backed write cache, and for reading lots of
system RAM.

>
> Hope this helps,
> Ron Peacetree
>

John
=:->

From:
Marko Ristola
Date:

I'm Sorry,
that I wrote that the option would risk the LOG
persistency with PostgreSQL.

I should have asked instead, that how you have taken this into account.

Tom Lane's email below convinces me, that you have taken the metadata
only journalling into account and still fulfill the persistency
of committed transactions.

This means, that Ext3 with data=writeback is safe with PostgreSQL even with
a hardware reset button. Metadata only journalling is faster, when it
can be used.

I didn't know, that any database can keep the database guarantees
with the metadata only journalling option.

I looked at your problem.
One of the problems is that you need to keep the certain data
cached in memory all the time.

That could be solved by doing
SELECT COUNT(*) from to_be_cached;
as a cron job. It loads the whole table into the Linux Kernel memory cache.


Marko Ristola

Tom Lane wrote:

>Right.  I think the optimal setting for a Postgres data directory is
>journaled metadata, non-journaled file content.  Postgres can take care
>of the data integrity for itself, but it does assume that the filesystem
>stays structurally sane (eg, data blocks don't get reassigned to the
>wrong file), so you need a filesystem guarantee about the metadata.
>
>WAL files are handled in a much more conservative way (created, filled
>with zeroes, and fsync'd before we ever put any valuable data in 'em).
>If you have WAL on its own drive then I think Mike's recommendation of
>no filesystem journalling at all for that drive is probably OK.  Or
>you can do same as above (journal metadata only) if you want a little
>extra protection.
>
>And of course all this reasoning depends on the assumption that the
>drive tells the truth about write-completion.  If the drive does write
>caching it had better be able to complete all its accepted writes before
>dying in a power failure.  (Hence, battery-backed write cache is OK, any
>other kind is evil.)
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: Don't 'kill -9' the postmaster
>
>


From:
Jeremiah Jahn
Date:

On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote:
> Ron wrote:
> > At 02:53 PM 8/20/2005, Jeremiah Jahn wrote:
> >
> >> On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote:
> >> > Jeremiah Jahn wrote:
> >> > > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote:
> >> > >
> >> <snip>
> >> >
> >> > > it's cached alright. I'm getting a read rate of about 150MB/sec. I
> >> would
> >> > > have thought is would be faster with my raid setup. I think I'm
> >> going to
> >> > > scrap the whole thing and get rid of LVM. I'll just do a straight
> >> ext3
> >> > > system. Maybe that will help. Still trying to get suggestions for a
> >> > > stripe size.
> >> > >
>
> Well, since you can get a read of the RAID at 150MB/s, that means that
> it is actual I/O speed. It may not be cached in RAM. Perhaps you could
> try the same test, only using say 1G, which should be cached.

[root@io pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=1000000
1000000+0 records in
1000000+0 records out

real    0m8.885s
user    0m0.299s
sys     0m6.998s
[root@io pgsql]# time dd of=/dev/null if=testfile bs=1024 count=1000000
1000000+0 records in
1000000+0 records out

real    0m1.654s
user    0m0.232s
sys     0m1.415s


>
> >> >
> >> > I don't think 150MB/s is out of the realm for a 14 drive array.
> >> > How fast is time dd if=/dev/zero of=testfile bs=8192 count=1000000
> >> >
> >> time dd if=/dev/zero of=testfile bs=8192 count=1000000
> >> 1000000+0 records in
> >> 1000000+0 records out
> >>
> >> real    1m24.248s
> >> user    0m0.381s
> >> sys     0m33.028s
> >>
> >>
> >> > (That should create a 8GB file, which is too big to cache everything)
> >> > And then how fast is:
> >> > time dd if=testfile of=/dev/null bs=8192 count=1000000
> >>
> >> time dd if=testfile of=/dev/null bs=8192 count=1000000
> >> 1000000+0 records in
> >> 1000000+0 records out
> >>
> >> real    0m54.139s
> >> user    0m0.326s
> >> sys     0m8.916s
> >>
> >>
> >> and on a second run:
> >>
> >> real    0m55.667s
> >> user    0m0.341s
> >> sys     0m9.013s
> >>
> >>
> >> >
> >> > That should give you a semi-decent way of measuring how fast the RAID
> >> > system is, since it should be too big to cache in ram.
> >>
> >> about 150MB/Sec. Is there no better way to make this go faster...?
>
> I'm actually curious about PCI bus saturation at this point. Old 32-bit
> 33MHz pci could only push 1Gbit = 100MB/s. Now, I'm guessing that this
> is a higher performance system. But I'm really surprised that your write
> speed is that close to your read speed. (100MB/s write, 150MB/s read).

The raid array I have is currently set up to use a single channel. But I
have dual controllers In the array. And dual external slots on the card.
The machine is brand new and has pci-e backplane.



>
> >
> > Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them
> > doing raw sequential IO like this should be capable of at
> >  ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's,  ~7*79MB/s= 553MB/s
BTW I'm using Seagate Cheetah 15K.4's

> > if using Fujitsu MAU's, and ~7*86MB/s= 602MB/s if using Maxtor Atlas 15K
> > II's to devices external to the RAID array.
>
> I know I thought these were SATA drives, over 2 controllers. I could be
> completely wrong, though.
>
> >
> > _IF_ the controller setup is high powered enough to keep that kind of IO
> > rate up.  This will require a controller or controllers providing dual
> > channel U320 bandwidth externally and quad channel U320 bandwidth
> > internally.  IOW, it needs a controller or controllers talking 64b
> > 133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized
> > IO buffer as well.
> >
> > AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic
> > MegaRAID controllers.  What I don't know is which exact one yours is,
> > nor do I know if it (or any of the MegaRAID controllers) are high
> > powered enough.

PERC4eDC-PCI Express, 128MB Cache, 2-External Channels

> >
> > Talk to your HW supplier to make sure you have controllers adequate to
> > your HD's.
> >
> > ...and yes, your average access time will be in the 5.5ms - 6ms range
> > when doing a physical seek.
> > Even with RAID, you want to minimize seeks and maximize sequential IO
> > when accessing them.
> > Best to not go to HD at all ;-)
>
> Well, certainly, if you can get more into RAM, you're always better off.
> For writing, a battery-backed write cache, and for reading lots of
> system RAM.

I'm not really worried about the writing, it's the reading the reading
that needs to be faster.

>
> >
> > Hope this helps,
> > Ron Peacetree
> >
>
> John
> =:->
--
Speak softly and carry a +6 two-handed sword.


From:
John A Meinel
Date:

Jeremiah Jahn wrote:
> On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote:
>
>>Ron wrote:
>>
>>>At 02:53 PM 8/20/2005, Jeremiah Jahn wrote:
>>
>>Well, since you can get a read of the RAID at 150MB/s, that means that
>>it is actual I/O speed. It may not be cached in RAM. Perhaps you could
>>try the same test, only using say 1G, which should be cached.
>
>
> [root@io pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=1000000
> 1000000+0 records in
> 1000000+0 records out
>
> real    0m8.885s
> user    0m0.299s
> sys     0m6.998s
> [root@io pgsql]# time dd of=/dev/null if=testfile bs=1024 count=1000000
> 1000000+0 records in
> 1000000+0 records out
>
> real    0m1.654s
> user    0m0.232s
> sys     0m1.415s
>

The write time seems about the same (but you only have 128MB of write
cache), but your read jumped up to 620MB/s. So you drives do seem to be
giving you 150MB/s.

>

...

>>I'm actually curious about PCI bus saturation at this point. Old 32-bit
>>33MHz pci could only push 1Gbit = 100MB/s. Now, I'm guessing that this
>>is a higher performance system. But I'm really surprised that your write
>>speed is that close to your read speed. (100MB/s write, 150MB/s read).
>
>
> The raid array I have is currently set up to use a single channel. But I
> have dual controllers In the array. And dual external slots on the card.
> The machine is brand new and has pci-e backplane.
>
>
>
>
>>>Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them
>>>doing raw sequential IO like this should be capable of at
>>> ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's,  ~7*79MB/s= 553MB/s
>
> BTW I'm using Seagate Cheetah 15K.4's
>

Now, are the numbers that Ron is quoting in megabytes or megabits? I'm
guessing he knows what he is talking about, and is doing megabytes.
80MB/s sustained seems rather high for a hard-disk.

Though this page:
http://www.storagereview.com/articles/200411/20041116ST3146754LW_2.html

Does seem to agree with that statement. (Between 56 and 93MB/s)

And since U320 is a 320MB/s bus, it doesn't seem like anything there
should be saturating. So why the low performance????

>>
>>>_IF_ the controller setup is high powered enough to keep that kind of IO
>>>rate up.  This will require a controller or controllers providing dual
>>>channel U320 bandwidth externally and quad channel U320 bandwidth
>>>internally.  IOW, it needs a controller or controllers talking 64b
>>>133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized
>>>IO buffer as well.
>>>
>>>AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic
>>>MegaRAID controllers.  What I don't know is which exact one yours is,
>>>nor do I know if it (or any of the MegaRAID controllers) are high
>>>powered enough.
>
>
> PERC4eDC-PCI Express, 128MB Cache, 2-External Channels

Do you know which card it is? Does it look like this one:
http://www.lsilogic.com/products/megaraid/megaraid_320_2e.html

Judging by the 320 speed, and 2 external controllers, that is my guess.
They at least claim a theoretical max of 2GB/s.

Which makes you wonder why reading from RAM is only able to get
throughput of 600MB/s. Did you run it multiple times? On my windows
system, I get just under 550MB/s for what should be cached, copying from
/dev/zero to /dev/null I get 2.4GB/s (though that might be a no-op).

On a similar linux machine, I'm able to get 1200MB/s for a cached file.
(And 3GB/s for a zero=>null copy).

John
=:->

>
>
>>>Talk to your HW supplier to make sure you have controllers adequate to
>>>your HD's.
>>>
>>>...and yes, your average access time will be in the 5.5ms - 6ms range
>>>when doing a physical seek.
>>>Even with RAID, you want to minimize seeks and maximize sequential IO
>>>when accessing them.
>>>Best to not go to HD at all ;-)
>>
>>Well, certainly, if you can get more into RAM, you're always better off.
>>For writing, a battery-backed write cache, and for reading lots of
>>system RAM.
>
>
> I'm not really worried about the writing, it's the reading the reading
> that needs to be faster.
>
>
>>>Hope this helps,
>>>Ron Peacetree
>>>
>>
>>John
>>=:->


From:
Ron
Date:

I'm resending this as it appears not to have made it to the list.

At 10:54 AM 8/21/2005, Jeremiah Jahn wrote:
>On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote:
> > Ron wrote:
> >
> > Well, since you can get a read of the RAID at 150MB/s, that means that
> > it is actual I/O speed. It may not be cached in RAM. Perhaps you could
> > try the same test, only using say 1G, which should be cached.
>
>[root@io pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=1000000
>1000000+0 records in
>1000000+0 records out
>
>real    0m8.885s
>user    0m0.299s
>sys     0m6.998s

This is abysmally slow.


>[root@io pgsql]# time dd of=/dev/null if=testfile bs=1024 count=1000000
>1000000+0 records in
>1000000+0 records out
>
>real    0m1.654s
>user    0m0.232s
>sys     0m1.415s

This transfer rate is the only one out of the 4 you have posted that
is in the vicinity of where it should be.


>The raid array I have is currently set up to use a single channel. But I
>have dual controllers in the array. And dual external slots on the card.
>The machine is brand new and has pci-e backplane.
>
So you have 2 controllers each with 2 external slots?  But you are
currently only using 1 controller and only one external slot on that
controller?


> > > Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them
> > > doing raw sequential IO like this should be capable of at
> > >  ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's
>BTW I'm using Seagate Cheetah 15K.4's

OK, now we have that nailed down.


> > > AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic
> > > MegaRAID controllers.  What I don't know is which exact one yours is,
> > > nor do I know if it (or any of the MegaRAID controllers) are high
> > > powered enough.
>
>PERC4eDC-PCI Express, 128MB Cache, 2-External Channels

Looks like they are using the LSI Logic MegaRAID SCSI 320-2E
controller.  IIUC, you have 2 of these, each with 2 external channels?

The specs on these appear a bit strange.  They are listed as being a
PCI-Ex8 card, which means they should have a max bandwidth of 20Gb/s=
2GB/s, yet they are also listed as only supporting dual channel U320=
640MB/s when they could easily support quad channel U320=
1.28GB/s.  Why bother building a PCI-Ex8 card when only a PCI-Ex4
card (which is a more standard physical format) would've been
enough?  Or if you are going to build a PCI-Ex8 card, why not support
quad channel U320?  This smells like there's a problem with LSI's design.

The 128MB buffer also looks suspiciously small, and I do not see any
upgrade path for it on LSI Logic's site.  "Serious" RAID controllers
from companies like Xyratex, Engino, and Dot-hill can have up to
1-2GB of buffer, and there's sound technical reasons for it.  See if
there's a buffer upgrade available or if you can get controllers that
have larger buffer capabilities.

Regardless of the above, each of these controllers should still be
good for about 80-85% of 640MB/s, or ~510-540 MB/s apiece when doing
raw sequential IO if you plug 3-4 fast enough HD's into each SCSI
channel.  Cheetah 15K.4's certainly are fast enough.  Optimal setup
is probably to split each RAID 1 pair so that one HD is on each of
the SCSI channels, and then RAID 0 those pairs.  That will also
protect you from losing the entire disk subsystem if one of the SCSI
channels dies.

That 128MB of buffer cache may very well be too small to keep the IO
rate up, and/or there may be a more subtle problem with the LSI card,
and/or you may have a configuration problem, but _something(s)_ need
fixing since you are only getting raw sequential IO of ~100-150MB/s
when it should be above 500MB/s.

This will make the most difference for initial reads (first time you
load a table, first time you make a given query, etc) and for any writes.

Your HW provider should be able to help you, even if some of the HW
in question needs to be changed.  You paid for a solution.  As long
as this stuff is performing at so much less then what it is supposed
to, you have not received the solution you paid for.

BTW, on the subject of RAID stripes IME the sweet spot tends to be in
the 64KB to 256KB range (very large, very read heavy data mines can
want larger RAID stripes.).  Only experimentation will tell you what
results in the best performance for your application.


>I'm not really worried about the writing, it's the reading the reading
>that needs to be faster.

Initial reads are only going to be as fast as your HD subsystem, so
there's a reason for making the HD subsystem faster even if all you
care about is reads.  In addition, I'll repeat my previous advice
that upgrading to 16GB of RAM would be well worth it for you.

Hope this helps,
Ron Peacetree



From:
Ron
Date:

At 10:54 AM 8/21/2005, Jeremiah Jahn wrote:
>On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote:
> > Ron wrote:
> >
> > Well, since you can get a read of the RAID at 150MB/s, that means that
> > it is actual I/O speed. It may not be cached in RAM. Perhaps you could
> > try the same test, only using say 1G, which should be cached.
>
>[root@io pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=1000000
>1000000+0 records in
>1000000+0 records out
>
>real    0m8.885s
>user    0m0.299s
>sys     0m6.998s

This is abysmally slow.


>[root@io pgsql]# time dd of=/dev/null if=testfile bs=1024 count=1000000
>1000000+0 records in
>1000000+0 records out
>
>real    0m1.654s
>user    0m0.232s
>sys     0m1.415s

This transfer rate is the only one out of the 4 you have posted that
is in the vicinity of where it should be.


>The raid array I have is currently set up to use a single channel. But I
>have dual controllers in the array. And dual external slots on the card.
>The machine is brand new and has pci-e backplane.
>
So you have 2 controllers each with 2 external slots?  But you are
currently only using 1 controller and only one external slot on that
controller?


> > > Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them
> > > doing raw sequential IO like this should be capable of at
> > >  ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's
>BTW I'm using Seagate Cheetah 15K.4's

OK, now we have that nailed down.


> > > AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic
> > > MegaRAID controllers.  What I don't know is which exact one yours is,
> > > nor do I know if it (or any of the MegaRAID controllers) are high
> > > powered enough.
>
>PERC4eDC-PCI Express, 128MB Cache, 2-External Channels

Looks like they are using the LSI Logic MegaRAID SCSI 320-2E
controller.  IIUC, you have 2 of these, each with 2 external channels?

The specs on these appear a bit strange.  They are listed as being a
PCI-Ex8 card, which means they should have a max bandwidth of 20Gb/s=
2GB/s, yet they are also listed as only supporting dual channel U320=
640MB/s when they could easily support quad channel U320=
1.28GB/s.  Why bother building a PCI-Ex8 card when only a PCI-Ex4
card (which is a more standard physical format) would've been
enough?  Or if you are going to build a PCI-Ex8 card, why not support
quad channel U320?  This smells like there's a problem with LSI's design.

The 128MB buffer also looks suspiciously small, and I do not see any
upgrade path for it on LSI Logic's site.  "Serious" RAID controllers
from companies like Xyratex, Engino, and Dot-hill can have up to
1-2GB of buffer, and there's sound technical reasons for it.  See if
there's a buffer upgrade available or if you can get controllers that
have larger buffer capabilities.

Regardless of the above, each of these controllers should still be
good for about 80-85% of 640MB/s, or ~510-540 MB/s apiece when doing
raw sequential IO if you plug 3-4 fast enough HD's into each SCSI
channel.  Cheetah 15K.4's certainly are fast enough.  Optimal setup
is probably to split each RAID 1 pair so that one HD is on each of
the SCSI channels, and then RAID 0 those pairs.  That will also
protect you from losing the entire disk subsystem if one of the SCSI
channels dies.

That 128MB of buffer cache may very well be too small to keep the IO
rate up, and/or there may be a more subtle problem with the LSI card,
and/or you may have a configuration problem, but _something(s)_ need
fixing since you are only getting raw sequential IO of ~100-150MB/s
when it should be above 500MB/s.

This will make the most difference for initial reads (first time you
load a table, first time you make a given query, etc) and for any writes.

Your HW provider should be able to help you, even if some of the HW
in question needs to be changed.  You paid for a solution.  As long
as this stuff is performing at so much less then what it is supposed
to, you have not received the solution you paid for.

BTW, on the subject of RAID stripes IME the sweet spot tends to be in
the 64KB to 256KB range (very large, very read heavy data mines can
want larger RAID stripes.).  Only experimentation will tell you what
results in the best performance for your application.


>I'm not really worried about the writing, it's the reading the reading
>that needs to be faster.

Initial reads are only going to be as fast as your HD subsystem, so
there's a reason for making the HD subsystem faster even if all you
care about is reads.  In addition, I'll repeat my previous advice
that upgrading to 16GB of RAM would be well worth it for you.

Hope this helps,
Ron Peacetree



From:
William Yu
Date:

Ron wrote:
>> PERC4eDC-PCI Express, 128MB Cache, 2-External Channels
>
> Looks like they are using the LSI Logic MegaRAID SCSI 320-2E
> controller.  IIUC, you have 2 of these, each with 2 external channels?

A lot of people have mentioned Dell's versions of the LSI cards can be
WAY slower than the ones you buy from LSI. Why this is the case? Nobody
knows for sure.

Here's a guess on my part. A while back, I was doing some googling. And
instead of typing "LSI MegaRAID xxx", I just typed "MegaRAID xxx". Going
beyond the initial pages, I saw Tekram -- a company that supposedly
produces their own controllers -- listing products with the exact model
numbers and photos as cards from LSI and Areca. Seemed puzzling until I
read a review about SATA RAID cards where it mentioned Tekram produces
the Areca cards under their own name but using slower components to
avoid competing at the highend with them.

So what may be happening is that the logic circuitry on the Dell PERCs
are the same as the source LSI cards, the speed of the RAID
processor/RAM/internal buffers/etc is not as fast so Dell can shave off
a few bucks for every server. That would mean while a true LSI card has
the processing power to do the RAID calculates for X drives, the Dell
version probably can only do X*0.6 drives or so.


> The 128MB buffer also looks suspiciously small, and I do not see any
> upgrade path for it on LSI Logic's site.  "Serious" RAID controllers
> from companies like Xyratex, Engino, and Dot-hill can have up to 1-2GB

The card is upgradable. If you look at the pic of the card, it shows a
SDRAM DIMM versus integrated RAM chips. I've also read reviews a while
back comparing benchmarks of the 320-2 w/ 128K versus 512K onboard RAM.
Their product literature is just nebulous on the RAM upgrade part. I'm
sure if you opened up the PDF manuals, you could find the exact info


> That 128MB of buffer cache may very well be too small to keep the IO
> rate up, and/or there may be a more subtle problem with the LSI card,
> and/or you may have a configuration problem, but _something(s)_ need
> fixing since you are only getting raw sequential IO of ~100-150MB/s when
> it should be above 500MB/s.

I think it just might be the Dell hardware or the lack of 64-bit IOMMU
on Xeon's. Here's my numbers on 320-1 w/ 128K paired up with Opterons
compared to Jeremiah's.

 >> # time dd if=/dev/zero of=testfile bs=1024 count=1000000
 >> 1000000+0 records in
 >> 1000000+0 records out
 >>
 >> real    0m8.885s
 >> user    0m0.299s
 >> sys     0m6.998s

2x15K RAID1
real    0m14.493s
user    0m0.255s
sys     0m11.712s

6x15K RAID10 (2x 320-1)
real    0m9.986s
user    0m0.200s
sys     0m8.634s


 >> # time dd of=/dev/null if=testfile bs=1024 count=1000000
 >> 1000000+0 records in
 >> 1000000+0 records out
 >>
 >> real    0m1.654s
 >> user    0m0.232s
 >> sys     0m1.415s

2x15K RAID1
real    0m3.383s
user    0m0.176s
sys     0m3.207s

6x15K RAID10 (2x 320-1)
real    0m2.427s
user    0m0.178s
sys     0m2.250s

If all 14 HDs are arranged in a RAID10 array, I'd say there's definitely
something wrong with Jeremiah's hardware.



From:
Jeremiah Jahn
Date:

On Sun, 2005-08-21 at 16:13 -0400, Ron wrote:
> At 10:54 AM 8/21/2005, Jeremiah Jahn wrote:
> >On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote:
> > > Ron wrote:
> > >
> > > Well, since you can get a read of the RAID at 150MB/s, that means that
> > > it is actual I/O speed. It may not be cached in RAM. Perhaps you could
> > > try the same test, only using say 1G, which should be cached.
> >
> >[root@io pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=1000000
> >1000000+0 records in
> >1000000+0 records out
> >
> >real    0m8.885s
> >user    0m0.299s
> >sys     0m6.998s
>
> This is abysmally slow.
>
>
> >[root@io pgsql]# time dd of=/dev/null if=testfile bs=1024 count=1000000
> >1000000+0 records in
> >1000000+0 records out
> >
> >real    0m1.654s
> >user    0m0.232s
> >sys     0m1.415s
>
> This transfer rate is the only one out of the 4 you have posted that
> is in the vicinity of where it should be.
>
>
> >The raid array I have is currently set up to use a single channel. But I
> >have dual controllers in the array. And dual external slots on the card.
> >The machine is brand new and has pci-e backplane.
> >
> So you have 2 controllers each with 2 external slots?  But you are
> currently only using 1 controller and only one external slot on that
> controller?

Sorry, no. I have one dual channel card in the system and two
controllers on the array. Dell PowerVault 220S w/ PERC4eDC-PCI Express

>
>
> > > > Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them
> > > > doing raw sequential IO like this should be capable of at
> > > >  ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's
> >BTW I'm using Seagate Cheetah 15K.4's
>
> OK, now we have that nailed down.
>
>
> > > > AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic
> > > > MegaRAID controllers.  What I don't know is which exact one yours is,
> > > > nor do I know if it (or any of the MegaRAID controllers) are high
> > > > powered enough.
> >
> >PERC4eDC-PCI Express, 128MB Cache, 2-External Channels
>
> Looks like they are using the LSI Logic MegaRAID SCSI 320-2E
> controller.  IIUC, you have 2 of these, each with 2 external channels?
>
> The specs on these appear a bit strange.  They are listed as being a
> PCI-Ex8 card, which means they should have a max bandwidth of 20Gb/s=
> 2GB/s, yet they are also listed as only supporting dual channel U320=
> 640MB/s when they could easily support quad channel U320=
> 1.28GB/s.  Why bother building a PCI-Ex8 card when only a PCI-Ex4
> card (which is a more standard physical format) would've been
> enough?  Or if you are going to build a PCI-Ex8 card, why not support
> quad channel U320?  This smells like there's a problem with LSI's design.
>
> The 128MB buffer also looks suspiciously small, and I do not see any
> upgrade path for it on LSI Logic's site.  "Serious" RAID controllers
> from companies like Xyratex, Engino, and Dot-hill can have up to
> 1-2GB of buffer, and there's sound technical reasons for it.  See if
> there's a buffer upgrade available or if you can get controllers that
> have larger buffer capabilities.
>
> Regardless of the above, each of these controllers should still be
> good for about 80-85% of 640MB/s, or ~510-540 MB/s apiece when doing
> raw sequential IO if you plug 3-4 fast enough HD's into each SCSI
> channel.  Cheetah 15K.4's certainly are fast enough.  Optimal setup
> is probably to split each RAID 1 pair so that one HD is on each of
> the SCSI channels, and then RAID 0 those pairs.  That will also
> protect you from losing the entire disk subsystem if one of the SCSI
> channels dies.
I like this idea, but how exactly does one bond the two channels
together? Won't this cause me to have both an /dev/sdb and an /dev/sdc?


>
> That 128MB of buffer cache may very well be too small to keep the IO
> rate up, and/or there may be a more subtle problem with the LSI card,
> and/or you may have a configuration problem, but _something(s)_ need
> fixing since you are only getting raw sequential IO of ~100-150MB/s
> when it should be above 500MB/s.

It looks like there's a way to add more memory to it.

>
> This will make the most difference for initial reads (first time you
> load a table, first time you make a given query, etc) and for any writes.
>
> Your HW provider should be able to help you, even if some of the HW
> in question needs to be changed.  You paid for a solution.  As long
> as this stuff is performing at so much less then what it is supposed
> to, you have not received the solution you paid for.
>
> BTW, on the subject of RAID stripes IME the sweet spot tends to be in
> the 64KB to 256KB range (very large, very read heavy data mines can
> want larger RAID stripes.).  Only experimentation will tell you what
> results in the best performance for your application.
I think I have them very small at the moment.

>
>
> >I'm not really worried about the writing, it's the reading the reading
> >that needs to be faster.
>
> Initial reads are only going to be as fast as your HD subsystem, so
> there's a reason for making the HD subsystem faster even if all you
> care about is reads.  In addition, I'll repeat my previous advice
> that upgrading to 16GB of RAM would be well worth it for you.

12GB is my max. I may run with it for a while and see.

>
> Hope this helps,
> Ron Peacetree
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
--
Speak softly and carry a +6 two-handed sword.


From:
John A Meinel
Date:

Jeremiah Jahn wrote:
> On Sun, 2005-08-21 at 16:13 -0400, Ron wrote:
>
>>At 10:54 AM 8/21/2005, Jeremiah Jahn wrote:
>>

...

>>So you have 2 controllers each with 2 external slots?  But you are
>>currently only using 1 controller and only one external slot on that
>>controller?
>
>
> Sorry, no. I have one dual channel card in the system and two
> controllers on the array. Dell PowerVault 220S w/ PERC4eDC-PCI Express
>
>

...

>>Regardless of the above, each of these controllers should still be
>>good for about 80-85% of 640MB/s, or ~510-540 MB/s apiece when doing
>>raw sequential IO if you plug 3-4 fast enough HD's into each SCSI
>>channel.  Cheetah 15K.4's certainly are fast enough.  Optimal setup
>>is probably to split each RAID 1 pair so that one HD is on each of
>>the SCSI channels, and then RAID 0 those pairs.  That will also
>>protect you from losing the entire disk subsystem if one of the SCSI
>>channels dies.
>
> I like this idea, but how exactly does one bond the two channels
> together? Won't this cause me to have both an /dev/sdb and an /dev/sdc?
>

Well, even if you did, you could always either use software raid, or lvm
to turn it into a single volume.

It also depends what the controller card bios would let you get away
with. Some cards would let you setup 4 RAID1's (one drive from each
channel), and then create a RAID0 of those pairs. Software raid should
do this without any problem. And can even be done such that it can be
grown in the future, as well as work across multiple cards (though the
latter is supported by some cards as well).

>
>
>>That 128MB of buffer cache may very well be too small to keep the IO
>>rate up, and/or there may be a more subtle problem with the LSI card,
>>and/or you may have a configuration problem, but _something(s)_ need
>>fixing since you are only getting raw sequential IO of ~100-150MB/s
>>when it should be above 500MB/s.
>
>
> It looks like there's a way to add more memory to it.

This memory probably helps more in writing than reading. If you are
reading the same area over and over, it might end up being a little bit
of extra cache for that (but it should already be cached in system RAM,
so you don't really get anything).

...

>>Initial reads are only going to be as fast as your HD subsystem, so
>>there's a reason for making the HD subsystem faster even if all you
>>care about is reads.  In addition, I'll repeat my previous advice
>>that upgrading to 16GB of RAM would be well worth it for you.
>
>
> 12GB is my max. I may run with it for a while and see.

If your working set truly is 10GB, then you can get a massive
performance increase even at 12GB. If your working set is 10GB and you
have 6GB of RAM, it probably is always swapping out what it just read
for the new stuff, even though you will read that same thing again in a
few seconds. So rather than just paying for the 4GB that can't be
cached, you pay for the whole 10.

John
=:->

>
>
>>Hope this helps,
>>Ron Peacetree
>>
>>
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 9: In versions below 8.0, the planner will ignore your desire to
>>       choose an index scan if your joining column's datatypes do not
>>       match