Thread: Random Page Cost and Planner

Random Page Cost and Planner

From

David Jarvis

Date:

25 May 2010, 02:54:23

Hi,

I wrote a query (see below) that extracts climate data from weather stations within a given radius of a city using the dates for which those weather stations actually have data. The query uses the measurement table's only index:

CREATE UNIQUE INDEX measurement_001_stc_idx
ON climate.measurement_001
USING btree
(station_id, taken, category_id);

The value for random_page_cost was at 2.0; reducing it to 1.1 had a massive performance improvement (nearly an order of magnitude). While the results now return in 5 seconds (down from ~85 seconds), problematic lines remain. Bumping the query's end date by a single year causes a full table scan:

    sc.taken_start >= '1900-01-01'::date AND
    sc.taken_end <= '1997-12-31'::date AND

How do I persuade PostgreSQL to use the indexes, regardless of number of years between the two dates? (A full table scan against 43 million rows is probably not the best plan.) Find the EXPLAIN ANALYSE results below the query.

Thanks again!

Dave

Query
SELECT
    extract(YEAR FROM m.taken) AS year,
    avg(m.amount) as amount
FROM
    climate.city c,
    climate.station s,
    climate.station_category sc,
    climate.measurement m
WHERE
    c.id = 5182 AND
    earth_distance(
      ll_to_earth(c.latitude_decimal,c.longitude_decimal),
      ll_to_earth(s.latitude_decimal,s.longitude_decimal)) / 1000 <= 30 AND
    s.elevation BETWEEN 0 AND 3000 AND
    s.applicable = TRUE AND
    sc.station_id = s.id AND
    sc.category_id = 1 AND
    sc.taken_start >= '1900-01-01'::date AND
    sc.taken_end <= '1996-12-31'::date AND
    m.station_id = s.id AND
    m.taken BETWEEN sc.taken_start AND sc.taken_end AND
    m.category_id = sc.category_id
GROUP BY
    extract(YEAR FROM m.taken)
ORDER BY
    extract(YEAR FROM m.taken)

1900 to 1996: Index
"Sort (cost=1348597.71..1348598.21 rows=200 width=12) (actual time=2268.929..2268.935 rows=92 loops=1)"
" Sort Key: (date_part('year'::text, (m.taken)::timestamp without time zone))"
" Sort Method: quicksort Memory: 32kB"
" -> HashAggregate (cost=1348586.56..1348590.06 rows=200 width=12) (actual time=2268.829..2268.886 rows=92 loops=1)"
"        -> Nested Loop (cost=0.00..1344864.01 rows=744510 width=12) (actual time=0.807..2084.206 rows=134893 loops=1)"
"              Join Filter: ((m.taken >= sc.taken_start) AND (m.taken <= sc.taken_end) AND (sc.station_id = m.station_id))"
"              -> Nested Loop (cost=0.00..12755.07 rows=1220 width=18) (actual time=0.502..521.937 rows=23 loops=1)"
"                    Join Filter: ((sec_to_gc(cube_distance((ll_to_earth((c.latitude_decimal)::double precision, (c.longitude_decimal)::double precision))::cube, (ll_to_earth((s.latitude_decimal)::double precision, (s.longitude_decimal)::double precision))::cube)) / 1000::double precision) <= 30::double precision)"
"                    -> Index Scan using city_pkey1 on city c (cost=0.00..2.47 rows=1 width=16) (actual time=0.014..0.015 rows=1 loops=1)"
"                          Index Cond: (id = 5182)"
"                    -> Nested Loop (cost=0.00..9907.73 rows=3659 width=34) (actual time=0.014..28.937 rows=3458 loops=1)"
"                          -> Seq Scan on station_category sc (cost=0.00..970.20 rows=3659 width=14) (actual time=0.008..10.947 rows=3458 loops=1)"
"                                Filter: ((taken_start >= '1900-01-01'::date) AND (taken_end <= '1996-12-31'::date) AND (category_id = 1))"
"                          -> Index Scan using station_pkey1 on station s (cost=0.00..2.43 rows=1 width=20) (actual time=0.004..0.004 rows=1 loops=3458)"
"                                Index Cond: (s.id = sc.station_id)"
"                                Filter: (s.applicable AND (s.elevation >= 0) AND (s.elevation <= 3000))"
"              -> Append (cost=0.00..1072.27 rows=947 width=18) (actual time=6.996..63.199 rows=5865 loops=23)"
"                    -> Seq Scan on measurement m (cost=0.00..25.00 rows=6 width=22) (actual time=0.000..0.000 rows=0 loops=23)"
"                          Filter: (m.category_id = 1)"
"                    -> Bitmap Heap Scan on measurement_001 m (cost=20.79..1047.27 rows=941 width=18) (actual time=6.995..62.390 rows=5865 loops=23)"
"                          Recheck Cond: ((m.station_id = sc.station_id) AND (m.taken >= sc.taken_start) AND (m.taken <= sc.taken_end) AND (m.category_id = 1))"
"                          -> Bitmap Index Scan on measurement_001_stc_idx (cost=0.00..20.55 rows=941 width=0) (actual time=5.775..5.775 rows=5865 loops=23)"
"                                Index Cond: ((m.station_id = sc.station_id) AND (m.taken >= sc.taken_start) AND (m.taken <= sc.taken_end) AND (m.category_id = 1))"
"Total runtime: 2269.264 ms"

1900 to 1997: Full Table Scan
"Sort (cost=1370192.26..1370192.76 rows=200 width=12) (actual time=86165.797..86165.809 rows=94 loops=1)"
" Sort Key: (date_part('year'::text, (m.taken)::timestamp without time zone))"
" Sort Method: quicksort Memory: 32kB"
" -> HashAggregate (cost=1370181.12..1370184.62 rows=200 width=12) (actual time=86165.654..86165.736 rows=94 loops=1)"
"        -> Hash Join (cost=4293.60..1366355.81 rows=765061 width=12) (actual time=534.786..85920.007 rows=139721 loops=1)"
"              Hash Cond: (m.station_id = sc.station_id)"
"              Join Filter: ((m.taken >= sc.taken_start) AND (m.taken <= sc.taken_end))"
"              -> Append (cost=0.00..867005.80 rows=43670150 width=18) (actual time=0.009..79202.329 rows=43670079 loops=1)"
"                    -> Seq Scan on measurement m (cost=0.00..25.00 rows=6 width=22) (actual time=0.001..0.001 rows=0 loops=1)"
"                          Filter: (category_id = 1)"
"                    -> Seq Scan on measurement_001 m (cost=0.00..866980.80 rows=43670144 width=18) (actual time=0.008..73312.008 rows=43670079 loops=1)"
"                          Filter: (category_id = 1)"
"              -> Hash (cost=4277.93..4277.93 rows=1253 width=18) (actual time=534.704..534.704 rows=25 loops=1)"
"                    -> Nested Loop (cost=847.87..4277.93 rows=1253 width=18) (actual time=415.837..534.682 rows=25 loops=1)"
"                          Join Filter: ((sec_to_gc(cube_distance((ll_to_earth((c.latitude_decimal)::double precision, (c.longitude_decimal)::double precision))::cube, (ll_to_earth((s.latitude_decimal)::double precision, (s.longitude_decimal)::double precision))::cube)) / 1000::double precision) <= 30::double precision)"
"                          -> Index Scan using city_pkey1 on city c (cost=0.00..2.47 rows=1 width=16) (actual time=0.012..0.014 rows=1 loops=1)"
"                                Index Cond: (id = 5182)"
"                          -> Hash Join (cost=847.87..1352.07 rows=3760 width=34) (actual time=6.427..35.107 rows=3552 loops=1)"
"                                Hash Cond: (s.id = sc.station_id)"
"                                -> Seq Scan on station s (cost=0.00..367.25 rows=7948 width=20) (actual time=0.004..23.529 rows=7949 loops=1)"
"                                      Filter: (applicable AND (elevation >= 0) AND (elevation <= 3000))"
"                                -> Hash (cost=800.87..800.87 rows=3760 width=14) (actual time=6.416..6.416 rows=3552 loops=1)"
"                                      -> Bitmap Heap Scan on station_category sc (cost=430.29..800.87 rows=3760 width=14) (actual time=2.316..5.353 rows=3552 loops=1)"
"                                            Recheck Cond: (category_id = 1)"
"                                            Filter: ((taken_start >= '1900-01-01'::date) AND (taken_end <= '1997-12-31'::date))"
"                                            -> Bitmap Index Scan on station_category_station_category_idx (cost=0.00..429.35 rows=6376 width=0) (actual time=2.268..2.268 rows=6339 loops=1)"
"                                                  Index Cond: (category_id = 1)"
"Total runtime: 86165.936 ms"

Re: Random Page Cost and Planner

From

David Jarvis

Date:

25 May 2010, 03:41:38

Hi,

I changed the date comparison to be based on year alone:

extract(YEAR FROM sc.taken_start) >= 1900 AND
extract(YEAR FROM sc.taken_end) <= 2009 AND

The indexes are now always used; if someone wants to explain why using the numbers works (a constant) but using a date (another constant?) does not work, I'd really appreciate it.

Thanks again, everybody, for your time and help.

Dave

Re: Random Page Cost and Planner

From

"Kevin Grittner"

Date:

25 May 2010, 15:29:00

David Jarvis <thangalin@gmail.com> wrote:

> The value for *random_page_cost* was at 2.0; reducing it to 1.1
> had a massive performance improvement (nearly an order of
> magnitude). While the results now return in 5 seconds (down from
> ~85 seconds)

It sounds as though the active portion of your database is pretty
much cached in RAM.  True?

> problematic lines remain. Bumping the query's end date by a single
> year causes a full table scan

> How do I persuade PostgreSQL to use the indexes, regardless of
> number of years between the two dates?

I don't know about "regardless of the number of years" -- but you
can make such plans look more attractive by cutting both
random_page_cost and seq_page_cost.  Some highly cached loads
perform well with these set to equal values on the order of 0.1 to
0.001.

> (A full table scan against 43 million rows is probably not the
> best plan.)

It would tend to be better than random access to 43 million rows, at
least if you need to go to disk for many of them.

-Kevin

Re: Random Page Cost and Planner

From

David Jarvis

Date:

25 May 2010, 20:27:01

Hi, Kevin.

Thanks for the response.

It sounds as though the active portion of your database is pretty
much cached in RAM. True?

I would not have thought so; there are seven tables, each with 39 to 43 million rows as:

CREATE TABLE climate.measurement (
id bigserial NOT NULL,
taken date NOT NULL,
station_id integer NOT NULL,
amount numeric(8,2) NOT NULL,
flag character varying(1) NOT NULL DEFAULT ' '::character varying,
category_id smallint NOT NULL,
}

The machine has 4GB of RAM, donated to PG as follows:

shared_buffers = 1GB
temp_buffers = 32MB
work_mem = 32MB
maintenance_work_mem = 64MB
effective_cache_size = 256MB

Everything else is at its default value. The kernel:

$ cat /proc/sys/kernel/shmmax
2147483648

Two postgres processes are enjoying the (virtual) space:

2619 postgres 20 0 1126m 524m 520m S 0 13.2 0:09.41 postgres
2668 postgres 20 0 1124m 302m 298m S 0 7.6 0:04.35 postgres

can make such plans look more attractive by cutting both
random_page_cost and seq_page_cost. Some highly cached loads
perform well with these set to equal values on the order of 0.1 to
0.001.

I tried this: no improvement.

It would tend to be better than random access to 43 million rows, at
least if you need to go to disk for many of them.

I thought that the index would take care of this? The index has been set to the unique key of:

station_id, taken, and category_id (the filter for child tables).

Each time I scan for data, I always provide the station identifier and its date range. The date range is obtained from another table (given the same station_id).

I will be trying various other indexes. I've noticed now that sometimes the results are very quick and sometimes very slow. For the query I posted, it would be great to know what would be the best indexes to use. I have a suspicion that that's going to require trial and many errors.

Dave

Re: Random Page Cost and Planner

From

Tom Lane

Date:

25 May 2010, 21:24:42

David Jarvis <thangalin@gmail.com> writes:
>> It sounds as though the active portion of your database is pretty
>> much cached in RAM.  True?

> I would not have thought so; there are seven tables, each with 39 to 43
> million rows as: [ perhaps 64 bytes per row ]
> The machine has 4GB of RAM, donated to PG as follows:

Well, the thing you need to be *really* wary of is setting the cost
parameters to make isolated tests look good.  When you repeat a
particular test case multiple times, all times after the first probably
are fully cached ... but if your DB doesn't actually fit in RAM, that
might not be too representative of what will happen under load.
So if you want to cut the xxx_page_cost settings some more, pay close
attention to what happens to average response time.

            regards, tom lane

Re: Random Page Cost and Planner

From

Rob Wultsch

Date:

25 May 2010, 21:57:27

On Tue, May 25, 2010 at 4:26 PM, David Jarvis <thangalin@gmail.com> wrote:
> shared_buffers = 1GB
> temp_buffers = 32MB
> work_mem = 32MB
> maintenance_work_mem = 64MB
> effective_cache_size = 256MB

Shouldn't effective_cache_size be significantly larger?

--
Rob Wultsch
wultsch@gmail.com

Re: Random Page Cost and Planner

From

David Jarvis

Date:

26 May 2010, 00:50:31

Hi, Tom.

Yes, that is what happened, making the tests rather meaningless, and giving me the false impression that the indexes were being used. They were but only because of cached results. When multiple users making different queries, the performance will return to ~80s per query.

I also tried Kevin's suggestion, which had no noticeable effect:

effective_cache_size = 512MB

That said, when using the following condition, the query is fast (1 second):

    extract(YEAR FROM sc.taken_start) >= 1963 AND
    extract(YEAR FROM sc.taken_end) <= 2009 AND

"                    -> Index Scan using measurement_013_stc_idx on measurement_013 m (cost=0.00..511.00 rows=511 width=15) (actual time=0.018..3.601 rows=3356 loops=104)"
"                          Index Cond: ((m.station_id = sc.station_id) AND (m.taken >= sc.taken_start) AND (m.taken <= sc.taken_end) AND (m.category_id = 7))"

This condition makes it slow (13 seconds on first run, 8 seconds thereafter):

    extract(YEAR FROM sc.taken_start) >= 1900 AND
    extract(YEAR FROM sc.taken_end) <= 2009 AND

"                          Filter: (category_id = 7)"
"                    -> Seq Scan on measurement_013 m (cost=0.00..359704.80 rows=18118464 width=15) (actual time=0.008..4025.692 rows=18118395 loops=1)"

At this point, I'm tempted to write a stored procedure that iterates over each station category for all the years of each station. My guess is that the planner's estimate for the number of rows that will be returned by extract(YEAR FROM sc.taken_start) >= 1900 is incorrect and so it chooses a full table scan for all rows. Even though the lower bound appears to be a constant value of the 1900, the average year a station started collecting data was 44 years ago (1965), and did so for an average of 21.4 years.

The part I am having trouble with is convincing PG to use the index for the station ID and the date range for when the station was active. Each station has a unique ID; the data in the measurement table is ordered by measurement date then by station.

Should I add a clustered index by station then by date?

Any other suggestions are very much appreciated.

Dave

Re: Random Page Cost and Planner

From

David Jarvis

Date:

26 May 2010, 03:13:56

Hi, Rob.

I tried bumping the effective_cache_size. It made no difference.

My latest attempt at forcing PostgreSQL to use the indexes involved two loops: one to loop over the stations, the other to extract the station data from the measurement table. The outer loop executes in 1.5 seconds. The inner loop does a full table scan for each record in the outer loop:

FOR station IN
    SELECT
      sc.station_id,
      sc.taken_start,
      sc.taken_end
    FROM
      climate.city c,
      climate.station s,
      climate.station_category sc
    WHERE
      c.id = city_id AND
      earth_distance(
        ll_to_earth(c.latitude_decimal,c.longitude_decimal),
        ll_to_earth(s.latitude_decimal,s.longitude_decimal)) / 1000 <= radius AND
      s.elevation BETWEEN elevation1 AND elevation2 AND
      s.applicable AND
      sc.station_id = s.id AND
      sc.category_id = category_id AND
      extract(YEAR FROM sc.taken_start) >= year1 AND
      extract(YEAR FROM sc.taken_end) <= year2
    ORDER BY
      sc.station_id
LOOP
    RAISE NOTICE 'B.1. % % %', station.station_id, station.taken_start, station.taken_end;

    FOR measure IN
      SELECT
        extract(YEAR FROM m.taken) AS year,
        avg(m.amount) AS amount
      FROM
        climate.measurement m
      WHERE
        m.station_id = station.station_id AND
        m.taken BETWEEN station.taken_start AND station.taken_end AND
        m.category_id = category_id
      GROUP BY
        extract(YEAR FROM m.taken)
    LOOP
      RAISE NOTICE ' B.2. % %', measure.year, measure.amount;
    END LOOP;
END LOOP;

I thought that the bold lines would have evoked index use. The values used for the inner query:

NOTICE: B.1. 754 1980-08-01 2001-11-30

When I run the query manually, using constants, it executes in ~25 milliseconds:

SELECT
extract(YEAR FROM m.taken) AS year,
avg(m.amount) AS amount
FROM
climate.measurement m
WHERE
m.station_id = 754 AND
m.taken BETWEEN '1980-08-01'::date AND '2001-11-30'::date AND
m.category_id = 7
GROUP BY
extract(YEAR FROM m.taken)

With 106 rows it should execute in ~2.65 seconds, which is better than the 5 seconds I get when everything is cached and a tremendous improvement over the ~85 seconds from cold.

I do not understand why the below query uses a full table scan (executes in ~13 seconds):

SELECT
extract(YEAR FROM m.taken) AS year,
avg(m.amount) AS amount
FROM
climate.measurement m
WHERE
m.station_id = station.station_id AND
m.taken BETWEEN station.taken_start AND station.taken_end AND
m.category_id = category_id
GROUP BY
extract(YEAR FROM m.taken)

Moreover, what can I do to solve the problem?

Thanks again!

Dave

Re: Random Page Cost and Planner

From

Alexey Klyukin

Date:

26 May 2010, 06:00:49

On May 26, 2010, at 6:50 AM, David Jarvis wrote:
>
> That said, when using the following condition, the query is fast (1 second):
>
>     extract(YEAR FROM sc.taken_start) >= 1963 AND
>     extract(YEAR FROM sc.taken_end) <= 2009 AND
>
> "                    ->  Index Scan using measurement_013_stc_idx on measurement_013 m  (cost=0.00..511.00 rows=511
width=15)(actual time=0.018..3.601 rows=3356 loops=104)" 
> "                          Index Cond: ((m.station_id = sc.station_id) AND (m.taken >= sc.taken_start) AND (m.taken
<=sc.taken_end) AND (m.category_id = 7))" 
>
> This condition makes it slow (13 seconds on first run, 8 seconds thereafter):
>
>     extract(YEAR FROM sc.taken_start) >= 1900 AND
>     extract(YEAR FROM sc.taken_end) <= 2009 AND
>
> "                          Filter: (category_id = 7)"
> "                    ->  Seq Scan on measurement_013 m  (cost=0.00..359704.80 rows=18118464 width=15) (actual
time=0.008..4025.692rows=18118395 loops=1)" 
>
> At this point, I'm tempted to write a stored procedure that iterates over each station category for all the years of
eachstation. My guess is that the planner's estimate for the number of rows that will be returned by extract(YEAR FROM
sc.taken_start)>= 1900 is incorrect and so it chooses a full table scan for all rows.  

Nope, it appears that the planner estimate is correct (it estimates 18118464 vs 18118464 real rows). I think what's
happeningthere is  that 18M rows is large enough part of the total table rows that it makes sense to scan it
sequentially(eliminating random access costs).  Try SET enable_seqsan = false and repeat the query - there is a chance
thatthe index scan would be even slower. 

> The part I am having trouble with is convincing PG to use the index for the station ID and the date range for when
thestation was active. Each station has a unique ID; the data in the measurement table is ordered by measurement date
thenby station. 
>
> Should I add a clustered index by station then by date?
>
> Any other suggestions are very much appreciated.

Is it necessary to get the data as far as 1900 all the time ? Maybe there is a possibility to aggregate results
from the past years if they are constant.

Regards,
--
Alexey Klyukin <alexk@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.

Re: Random Page Cost and Planner

From

tv@fuzzy.cz

Date:

26 May 2010, 07:28:30

 Current Folder: Sent       Sign Out
Compose   Addresses   Folders   Options   Autoreply   Search   Help
Calendar       G-Hosting.cz

Message List | Delete | Edit Message as New    Previous | Next     Forward |
Forward as Attachment | Reply | Reply All
Subject:       Re: [PERFORM] Random Page Cost and Planner
From:       tv@fuzzy.cz
Date:       Wed, May 26, 2010 12:01 pm
To:       "David Jarvis" <thangalin@gmail.com>
Priority:       Normal
Options:       View Full Header |  View Printable Version  | Download this as
a file  | View Message details

> Hi, Tom.
>
> Yes, that is what happened, making the tests rather meaningless, and
> giving
> me the false impression that the indexes were being used. They were but
> only
> because of cached results. When multiple users making different queries,
> the
> performance will return to ~80s per query.
>
> I also tried Kevin's suggestion, which had no noticeable effect:
> effective_cache_size = 512MB
>
> That said, when using the following condition, the query is fast (1
> second):
>
>     extract(YEAR FROM sc.taken_start) >= 1963 AND
>     extract(YEAR FROM sc.taken_end) <= 2009 AND
>
> "                    ->  Index Scan using measurement_013_stc_idx on
> measurement_013 m  (cost=0.00..511.00 rows=511 width=15) (actual
> time=0.018..3.601 rows=3356 loops=104)"
> "                          Index Cond: ((m.station_id = sc.station_id) AND
> (m.taken >= sc.taken_start) AND (m.taken <= sc.taken_end) AND
> (m.category_id
> = 7))"
>
> This condition makes it slow (13 seconds on first run, 8 seconds
> thereafter):
>
> *    extract(YEAR FROM sc.taken_start) >= 1900 AND
> *    extract(YEAR FROM sc.taken_end) <= 2009 AND
>
> "                          Filter: (category_id = 7)"
> "                    ->  Seq Scan on measurement_013 m
> (cost=0.00..359704.80 rows=18118464 width=15) (actual time=0.008..4025.692
> rows=18118395 loops=1)"
>
> At this point, I'm tempted to write a stored procedure that iterates over
> each station category for all the years of each station. My guess is that
> the planner's estimate for the number of rows that will be returned by
> *extract(YEAR
> FROM sc.taken_start) >= 1900* is incorrect and so it chooses a full table
> scan for all rows. Even though the lower bound appears to be a constant
> value of the 1900, the average year a station started collecting data was
> 44
> years ago (1965), and did so for an average of 21.4 years.
>
> The part I am having trouble with is convincing PG to use the index for
> the
> station ID and the date range for when the station was active. Each
> station
> has a unique ID; the data in the measurement table is ordered by
> measurement
> date then by station.

Well, don't forget indexes may not be the best way to evaluate the query -
if the selectivity is low (the query returns large portion of the table)
the sequetial scan is actually faster. The problem is using index means
you have to read the index blocks too, and then the table blocks, and this
is actually random access. So your belief that thanks to using indexes the
query will run faster could be false.

And this is what happens in the queries above - the first query covers
years 1963-2009, while the second one covers 1900-2009. Given the fact
this table contains ~40m rows, the first query returns about 0.01% (3k
rows) while the second one returns almost 50% of the data (18m rows). So I
doubt this might be improved using an index ...

But you can try that by setting enable_seqscan=off or proper setting of
the random_page_cost / seq_page_cost variables (so that the plan with
indexes is cheaper than the sequential scan). You can do that in the
session (e.g. use SET enable_seqscan=off) so that you won't harm other
sessions.

> Should I add a clustered index by station then by date?
>
> Any other suggestions are very much appreciated.

Well, the only thing that crossed my mind is partitioning with properly
defined constraints and constrain_exclusion=on. I'd recommend partitioning
by time (each year a separate partition) but you'll have to investigate
that on your own (depends on your use-cases).

BTW the cache_effective_size mentioned in the previous posts is just an
'information parameter' - it does not increase the amount of memory
allocated by PostgreSQL. It merely informs PostgreSQL of expected disk
cache size maintained by the OS (Linux), so that PostgreSQL may estimate
the change that the requested data are actually cached (and won't be read
from the disk).

regards
Tomas

Re: Random Page Cost and Planner

From

"Kevin Grittner"

Date:

26 May 2010, 12:55:18

David Jarvis <thangalin@gmail.com> wrote:

>> It sounds as though the active portion of your database is pretty
>> much cached in RAM.  True?

> I would not have thought so; there are seven tables, each with 39
> to 43 million rows

> The machine has 4GB of RAM

In that case, modifying seq_page_cost or setting random_page_cost
below something in the range of 1.5 to 2 is probably not going to be
a good choice for the mix as a whole.

> effective_cache_size = 256MB

This should probably be set to something on the order of 3GB.  This
will help the optimizer make more intelligent choices about when use
of the index will be a win.

>> It would tend to be better than random access to 43 million rows,
>> at least if you need to go to disk for many of them.
>
> I thought that the index would take care of this?

When the index can limit the number of rows to a fraction of the 43
million rows, using it is a win.  The trick is to accurately model
the relative costs of different aspects of running the query, so
that when the various plans are compared, the one which looks the
cheapest actually *is*.  Attempting to force any particular plan
through other means is risky.

> I will be trying various other indexes. I've noticed now that
> sometimes the results are very quick and sometimes very slow. For
> the query I posted, it would be great to know what would be the
> best indexes to use. I have a suspicion that that's going to
> require trial and many errors.

Yeah, there's no substitute for testing your actual software against
the actual data.  Be careful, though -- as previously mentioned
caching can easily distort results, particularly when you run the
same query, all by itself (with no competing queries) multiple
times.  You'll get your best information if you can simulate a
more-or-less realistic load, and try that with various settings and
indexes.  The cache turnover and resource contention involved in
production can influence performance, and are hard to estimate any
other way.

-Kevin

Re: Random Page Cost and Planner

From

David Jarvis

Date:

26 May 2010, 13:30:36

Hi, Alexey.

Is it necessary to get the data as far as 1900 all the time ? Maybe there is a possibility to aggregate results from the past years if they are constant.

This I have done. I created another table (station_category) that associates stations with when they started to take measurements and when they stopped (based on the data in the measurement table). For example:

station_id; category_id; taken_start; taken_end
1;4;"1984-07-01";"1996-11-30"
1;5;"1984-07-01";"1996-11-30"
1;6;"1984-07-01";"1996-11-10"
1;7;"1984-07-01";"1996-10-31"

This means that station 1 has data for categories 4 through 7. The measurement table returns 3865 rows for station 1 and category 7 (this uses an index and took 7 seconds cold):

station_id; taken; amount
1;"1984-07-01";0.00
1;"1984-07-02";0.00
1;"1984-07-03";0.00
1;"1984-07-04";0.00

The station_category table is basically another index.

Would explicitly sorting the measurement table (273M rows) by station then by date help?

Dave

Re: Random Page Cost and Planner

From

tv@fuzzy.cz

Date:

26 May 2010, 15:16:41

> Hi,
>
> And this is what happens in the queries above - the first query covers
>> years 1963-2009, while the second one covers 1900-2009. Given the fact
>> this table contains ~40m rows, the first query returns about 0.01% (3k
>> rows) while the second one returns almost 50% of the data (18m rows). So
>> I
>> doubt this might be improved using an index ...
>>
>
> I don't think that's what I'm doing.
>
> There are two tables involved: station_category (sc) and measurement (m).
>
> The first part of the query:
>
>      extract(YEAR FROM sc.taken_start) >= 1900 AND
>      extract(YEAR FROM sc.taken_end) <= 2009 AND
>
> That is producing a limit on the station_category table. There are, as far
> as I can tell, no stations that have been taking weather readings for 110
> years. Most of them have a lifespan of 24 years. The above condition just
> makes sure that I don't get data before 1900 or after 2009.
>


OK, I admit I'm a little bit condfused by the query, especially by these
rows:

sc.taken_start >= '1900-01-01'::date AND
sc.taken_end <= '1996-12-31'::date AND
m.taken BETWEEN sc.taken_start AND sc.taken_end AND

Which seems to me a little bit "convoluted". Well, I think I understand
what that means - give me all stations for a given city, collecting the
category of data at a certain time. But I'm afraid this makes the planning
much more difficult, as the select from measurements depend on the data
returned by other parts of the query (rows from category).

See this http://explain.depesz.com/s/H1 and this
http://explain.depesz.com/s/GGx

I guess the planner is confused in the second case - believes it has to
read a lot more data from the measurement table, and so chooses the
sequential scan. The question is if this is the right decision (I believe
it is not).

How many rows does the query return without the group by clause? About
140000 in both cases, right?

>> by time (each year a separate partition) but you'll have to investigate
>> that on your own (depends on your use-cases).
>>
>
> I cannot partition by time. First, there are 7 categories, which would
> mean
> 770 partitions if I did it by year -- 345000 rows per partition. This will
> grow in the future. I have heard there are troubles with having lots of
> child tables (too many files for the operating system). Second, the user
> has
> the ability to pick arbitrary day ranges for arbitrary year spans.
>
> There's a "year wrapping" issue that I won't explain because I never get
> it
> right the first time. ;-)

OK, I haven't noticed the table is already partitioned by category_id and
I didn't mean to partition by (taken, category_id) - that would produce a
lot of partitions. Yes, that might cause problems related to number of
files, but that's rather a filesystem related issue.

I'd expect rather issues related to RULEs or triggers (not sure which of
them you use to redirect the data into partitions). But when partitioning
by time (and not by category_id) the number of partitions will be much
lower and you don't have to keep all of the rules active - all you need is
a rule for the current year (and maybe the next one).

I'm not sure what you mean by 'year wrapping issue' but I think it might
work quite well - right not the problem is PostgreSQL decides to scan the
whole partition (all data for a given category_id).

regards
Tomas

Re: Random Page Cost and Planner

From

David Jarvis

Date:

26 May 2010, 15:27:04

Hi, Kevin.

below something in the range of 1.5 to 2 is probably not going to be
a good choice for the mix as a whole.

Good to know; thanks.

This should probably be set to something on the order of 3GB. This
will help the optimizer make more intelligent choices about when use
of the index will be a win.

I'll try this.

times. You'll get your best information if you can simulate a
more-or-less realistic load, and try that with various settings and

I have no idea what a realistic load will be. The system is still in development and not open to the general public. I also don't know how much publicity the system will receive when finished. Could be a few hundred hits, could be over ten thousand.

I want the system to be ready for the latter case, which means it needs to return data for many different query parameters (date span, elevation, year, radius, etc.) in under two seconds.

indexes. The cache turnover and resource contention involved in
production can influence performance, and are hard to estimate any
other way.

Another person suggested to take a look at the data.

I ran a query to see if it makes sense to split the data by year. The trouble is that there are 110 years and 7 categories. The data is already filtered into child tables by category (that is logical because reporting on two different categories is nonsensical -- it is meaningless to report on snow depth and temperature: we already know it needs to be cold for snow).

count;decade start; decade end; min date; max date
3088;1990;2000;"1990-01-01";"2009-12-31"
2925;1980;2000;"1980-01-01";"2009-12-31"
2752;2000;2000;"2000-01-01";"2009-12-31"
2487;1970;1970;"1970-01-01";"1979-12-31"
2391;1980;1990;"1980-02-01";"1999-12-31"
2221;1980;1980;"1980-01-01";"1989-12-31"
1934;1960;2000;"1960-01-01";"2009-12-31"
1822;1960;1960;"1960-01-01";"1969-12-31"
1659;1970;1980;"1970-01-01";"1989-12-31"
1587;1960;1970;"1960-01-01";"1979-12-31"
1524;1970;2000;"1970-01-01";"2009-12-31"

The majority of data collected by weather stations is between 1960 and 2009, which makes sense because transistor technology would have made for (relatively) inexpensive automated monitoring stations. Or maybe there were more people and more taxes collected thus a bigger budget for weather study. Either way. ;-)

The point is the top three decades (1990, 1980, 2000) have the most data, giving me a few options:

Split the seven tables twice more: before 1960 and after 1960.
Split the seven tables by decade.

The first case gives 14 tables. The second case gives 102 tables (at 2.5M rows per table) as there are about 17 decades in total. This seems like a manageable number of tables as the data might eventually span 22 decades, which would be 132 tables.

Even though the users will be selecting 1900 to 2009, most of the stations themselves will be within the 1960 - 2009 range, with the majority of those active between 1980 and 2009.

Would splitting by decade improve the speed?

Thank you very much.

Dave

Re: Random Page Cost and Planner

From

David Jarvis

Date:

26 May 2010, 15:55:46

Hi,

sc.taken_end <= '1996-12-31'::date AND
m.taken BETWEEN sc.taken_start AND sc.taken_end AND

category of data at a certain time. But I'm afraid this makes the planning
much more difficult, as the select from measurements depend on the data
returned by other parts of the query (rows from category).

Right. Users can select 1900 - 2009. Station data hardly ever spans that range.

The station_category is used to create a unique key into the measurement data for every station: station_id, category_id, and taken_start. The measurement data should be contiguous until taken_end.

I thought that that combination would be a pointer to the exact spot in the measurement table where the data starts, which should be ridiculously fast to find.

See this http://explain.depesz.com/s/H1 and this
http://explain.depesz.com/s/GGx

I was getting some red lines when I looked at a different plan. It's a great site.

How many rows does the query return without the group by clause? About
140000 in both cases, right?

SELECT
*
FROM
climate.measurement m
WHERE
m.station_id = 5148 AND
m.taken BETWEEN '1900-08-01'::date AND '2009-12-31'::date AND
m.category_id = 1

5397 rows (10 seconds cold; 0.5 seconds hot); estimated too high by 2275 rows?

http://explain.depesz.com/s/uq

OK, I haven't noticed the table is already partitioned by category_id and
I didn't mean to partition by (taken, category_id) - that would produce a
lot of partitions. Yes, that might cause problems related to number of
files, but that's rather a filesystem related issue.

Constrained as:

CONSTRAINT measurement_013_category_id_ck CHECK (category_id = 7)

I'd expect rather issues related to RULEs or triggers (not sure which of
them you use to redirect the data into partitions). But when partitioning

I created seven child tables of measurement. Each of these has a constraint by category_id. This makes it extremely fast to select the correct partition.

I'm not sure what you mean by 'year wrapping issue' but I think it might
work quite well - right not the problem is PostgreSQL decides to scan the
whole partition (all data for a given category_id).

I'll give it another try. :-)

Use Case #1
User selects: Mar 22 to Dec 22
User selects: 1900 to 2009

Result: Query should average 9 months of climate data per year between Mar 22 and Dec 22 of Year.

Use Case #2

User selects: Dec 22 to Mar 22
User selects: 1900 to 2009

Result: Query should average 3 months of climate data per year between Dec 22 of Year and Mar 22 of Year+1.

So if a user selects 1950 to 1960:

first case should average between 1950 and 1960; and
second case should average between 1950 and 1961.

Dave

Re: Random Page Cost and Planner

From

David Jarvis

Date:

26 May 2010, 17:21:29

I was told to try OVERLAPS instead of checking years. The query is now:

SELECT
    extract(YEAR FROM m.taken) AS year,
    avg(m.amount) as amount
FROM
    climate.city c,
    climate.station s,
    climate.station_category sc,
    climate.measurement m
WHERE
    c.id = 5148 AND
    earth_distance(
      ll_to_earth(c.latitude_decimal,c.longitude_decimal),
      ll_to_earth(s.latitude_decimal,s.longitude_decimal)) / 1000 <= 30 AND
    s.elevation BETWEEN 0 AND 3000 AND
    s.applicable = TRUE AND
    sc.station_id = s.id AND
    sc.category_id = 7 AND
    (sc.taken_start, sc.taken_end) OVERLAPS ('1900-01-01'::date, '2009-12-31'::date) AND
    m.station_id = s.id AND
    m.taken BETWEEN sc.taken_start AND sc.taken_end AND
    m.category_id = sc.category_id
GROUP BY
    extract(YEAR FROM m.taken)
ORDER BY
    extract(YEAR FROM m.taken)

25 seconds from cold, no full table scan:

http://explain.depesz.com/s/VV5

Much better than 85 seconds, but still an order of magnitude too slow.

I was thinking of changing the station_category table to use the measurement table's primary key, instead of keying off date, as converting the dates for comparison strikes me as a bit of overhead. Also, I can get remove the "/ 1000" by changing the Earth's radius to kilometres (from metres), but a constant division shouldn't be significant.

I really appreciate all your patience and help over the last sixteen days trying to optimize this database and these queries.

Dave

Re: Random Page Cost and Planner

From

David Jarvis

Date:

27 May 2010, 04:43:24

Hi, Bryan.

I was just about to reply to the thread, thanks for asking. Clustering was key. After rebooting the machine (just to make sure absolutely nothing was cached), I immediately ran a report on Toronto: 5.25 seconds!

Here's what I did:

Created a new set of tables that matched the old set, with statistics of 1000 on the station and taken (date) columns.
Inserted the data from the old hierarchy into the new set, ordered by station id then by date (same seven child tables as before: one per category).
- I wanted to ensure a strong correlation between primary key and station id.
Added three indexes per table: (a) station id; (b) date taken; and (c) station-taken-category.
Set the station-taken-category index as CLUSTER.
Vacuumed the new tables.
Dropped the old tables.
Set the following configuration values:
- shared_buffers = 1GB
- temp_buffers = 32MB
- work_mem = 32MB
- maintenance_work_mem = 64MB
- seq_page_cost = 1.0
- random_page_cost = 2.0
- cpu_index_tuple_cost = 0.001
- effective_cache_size = 512MB

I ran a few more reports (no reboots, but reading vastly different data sets):

Vancouver: 4.2s
Yellowknife: 1.7s
Montreal: 6.5s
Trois-Riviers: 2.8s

No full table scans. I imagine some indexes are not strictly necessary and will test to see which can be removed (my guess: the station and taken indexes). The problem was that the station ids were scattered and so PostgreSQL presumed a full table scan would be faster.

Physically ordering the data by station ids triggers index use every time.

Next week's hardware upgrade should halve those times -- unless anyone has further suggestions to squeeze more performance out of PG. ;-)

Dave

Re: Random Page Cost and Planner

From

Cédric Villemain

Date:

27 May 2010, 05:03:46

2010/5/27 David Jarvis <thangalin@gmail.com>:
> Hi, Bryan.
>
> I was just about to reply to the thread, thanks for asking. Clustering was
> key. After rebooting the machine (just to make sure absolutely nothing was
> cached), I immediately ran a report on Toronto: 5.25 seconds!
>
> Here's what I did:
>
> Created a new set of tables that matched the old set, with statistics of
> 1000 on the station and taken (date) columns.
> Inserted the data from the old hierarchy into the new set, ordered by
> station id then by date (same seven child tables as before: one per
> category).
>
> I wanted to ensure a strong correlation between primary key and station id.
>
> Added three indexes per table: (a) station id; (b) date taken; and (c)
> station-taken-category.
> Set the station-taken-category index as CLUSTER.
> Vacuumed the new tables.
> Dropped the old tables.
> Set the following configuration values:
>
> shared_buffers = 1GB
> temp_buffers = 32MB
> work_mem = 32MB
> maintenance_work_mem = 64MB
> seq_page_cost = 1.0
> random_page_cost = 2.0
> cpu_index_tuple_cost = 0.001
> effective_cache_size = 512MB
>
> I ran a few more reports (no reboots, but reading vastly different data
> sets):
>
> Vancouver: 4.2s
> Yellowknife: 1.7s
> Montreal: 6.5s
> Trois-Riviers: 2.8s
>
> No full table scans. I imagine some indexes are not strictly necessary and
> will test to see which can be removed (my guess: the station and taken
> indexes). The problem was that the station ids were scattered and so
> PostgreSQL presumed a full table scan would be faster.
>
> Physically ordering the data by station ids triggers index use every time.
>
> Next week's hardware upgrade should halve those times -- unless anyone has
> further suggestions to squeeze more performance out of PG. ;-)

I wonder what the plan will be if you replace sc.taken_* in :
m.taken BETWEEN sc.taken_start AND sc.taken_end
by values. It might help the planner...

Also, I'll consider explicit ordered join but I admit I haven't read
the whole thread (in particular the table size).
Ho, and I set statistics to a highter value for column category_id,
table station_category  (seeing the same resquest and explain analyze
without date in the query will help)


>
> Dave
>
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: Random Page Cost and Planner

From

David Jarvis

Date:

27 May 2010, 12:56:03

Salut, Cédric.

I wonder what the plan will be if you replace sc.taken_* in :
m.taken BETWEEN sc.taken_start AND sc.taken_end
by values. It might help the planner...

That is a fairly important restriction. I will try making it (year1||'-01-01')::date, but I have no constant value for it -- it is a user-supplied parameter. And then there's the year wrapping problem, too, where the ending year will differ from the starting year in certain cases. (Like querying rows between Dec 22, 1900 to Mar 22 1901 rather than Mar 22 1900 to Dec 22 1900. The first query is the winter season and the second query is all seasons except winter.)

Also, I'll consider explicit ordered join but I admit I haven't read
the whole thread (in particular the table size).

C'est une grosse table. Pres que 40 million lines; il y a sept tableau comme ca.

I tried an explicit join in the past: it did not help much. But that was before everything was running this fast, so now that the system performs differently, maybe it will help?

Dave

Re: Random Page Cost and Planner

From

Cédric Villemain

Date:

27 May 2010, 15:29:06

2010/5/27 David Jarvis <thangalin@gmail.com>:
> Salut, Cédric.
>
>> I wonder what the plan will be if you replace sc.taken_* in :
>> m.taken BETWEEN sc.taken_start AND sc.taken_end
>> by values. It might help the planner...
>
> That is a fairly important restriction. I will try making it
> (year1||'-01-01')::date, but I have no constant value for it -- it is a
> user-supplied parameter. And then there's the year wrapping problem, too,
> where the ending year will differ from the starting year in certain cases.
> (Like querying rows between Dec 22, 1900 to Mar 22 1901 rather than Mar 22
> 1900 to Dec 22 1900. The first query is the winter season and the second
> query is all seasons except winter.)

Ah, I though that you had a start and an end provided (so able to put
them in the query)

>
>>
>> Also, I'll consider explicit ordered join but I admit I haven't read
>> the whole thread (in particular the table size).
>
> C'est une grosse table. Pres que 40 million lines; il y a sept tableau comme
> ca.
>
> I tried an explicit join in the past: it did not help much. But that was
> before everything was running this fast, so now that the system performs
> differently, maybe it will help?

yes. the documentation is fine for this topic :
http://www.postgresql.org/docs/8.4/interactive/explicit-joins.html
Consider the parameter to explicit join order (you can set it per sql session).

You know your data and know what are the tables with less results to
join first.  ;)

>
> Dave
>
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: Random Page Cost and Planner

From

Bryan Hinton

Date:

27 May 2010, 19:04:34

Agree with Tom on his point about avoidance of cost param adjustments to fit specific test cases.

A few suggestions...as I assume you own this database...

- check out pg_statio_user_tables - optimize your cache hit ratio on blocks read...different time durations... pg_stat_bgwriter (read from a script or something and snapshot)

- pg_buffercache in contrib/

- /proc/meminfo on linux

- find out exactly what is going on with your kernel buffer cache (size, how it is buffering) and if your controller or drive is using a read ahead cache.

- might want to play around with partial indexes vs. and/or range partitioning with exclusion constraints, etc.

- define I/O characteristics of the dataset - taking into account index clustering and index order on in-memory pages (i.e. re-cluster?), why need for multiple index if clustering indexes on heap?

- solidify the referential integrity constraints between those tables, on paper....define the use cases before modifying the database tables...i assume this is a dev database

- linux fs mount options to explore - i.e. noatime, writeback, etc.

-maybe look at prepared statements if you are running alot of similar queries from a single session? assuming web front end for your db - with say frequently queried region/category/dates for large read-only dataset with multiple join conditions?

There are some good presentations on pgcon.org from PGCon 2010 that was held last week...

http://www.pgcon.org/2010/schedule/events/218.en.html

If you take everything into account and model it correctly (not too loose, not too tight), your solution will be reusable and will save time and hardware expenses.

Regards -

Bryan

On Thu, May 27, 2010 at 2:43 AM, David Jarvis <thangalin@gmail.com> wrote:

Hi, Bryan.

I was just about to reply to the thread, thanks for asking. Clustering was key. After rebooting the machine (just to make sure absolutely nothing was cached), I immediately ran a report on Toronto: 5.25 seconds!

Here's what I did:
Created a new set of tables that matched the old set, with statistics of 1000 on the station and taken (date) columns.
Inserted the data from the old hierarchy into the new set, ordered by station id then by date (same seven child tables as before: one per category).
I wanted to ensure a strong correlation between primary key and station id.
Added three indexes per table: (a) station id; (b) date taken; and (c) station-taken-category.
Set the station-taken-category index as CLUSTER.
Vacuumed the new tables.
Dropped the old tables.
Set the following configuration values:
shared_buffers = 1GB
temp_buffers = 32MB
work_mem = 32MB
maintenance_work_mem = 64MB

seq_page_cost = 1.0
random_page_cost = 2.0
cpu_index_tuple_cost = 0.001
effective_cache_size = 512MB

I ran a few more reports (no reboots, but reading vastly different data sets):

Vancouver: 4.2s
Yellowknife: 1.7s
Montreal: 6.5s
Trois-Riviers: 2.8s

Re: Random Page Cost and Planner

From

David Jarvis

Date:

28 May 2010, 00:29:27

Hi, Bryan.

Thanks for the notes. I thought about using a prepared statement, but I cannot find any examples of using a PREPARE statement from within a function, and don't really feel like tinkering around to figure it out.

Performance is at the point where the Java/PHP bridge and JasperReports are bottlenecks. The run_time variable seldom goes beyond 2.6s now. The reports take about 5 - 6 seconds to appear. At this point I'm into diminishing returns.

I can perform a 60-minute hardware upgrade or spend 12 hours profiling to get less than the same net effect (and there is no guarantee I can improve the performance in fewer than 12 hours -- it took me 17 days and countless e-mails to this mailing group just to get this far -- thank you again for all the help, by the way). (If I was a PostgreSQL guru like most people on this list, it might take me 2 hours of profiling to optimize away the remaining bottlenecks, but even then the gain would only be a second or two in the database arena; the other system components will also gain by a hardware upgrade.)

Dave

Weird XFS WAL problem

From

Craig James

Date:

02 June 2010, 20:36:08

I'm testing/tuning a new midsize server and ran into an inexplicable problem.  With an RAID10 drive, when I move the
WALto a separate RAID1 drive, TPS drops from over 1200 to less than 90!   I've checked everything and can't find a
reason.

Here are the details.

8 cores (2x4 Intel Nehalem 2 GHz)
12 GB memory
12 x 7200 SATA 500 GB disks
3WARE 9650SE-12ML RAID controller with bbu
   2 disks: RAID1  500GB ext4  blocksize=4096
   8 disks: RAID10 2TB, stripe size 64K, blocksize=4096 (ext4 or xfs - see below)
   2 disks: hot swap
Ubuntu 10.04 LTS (Lucid)

With xfs or ext4 on the RAID10 I got decent bonnie++ and pgbench results (this one is for xfs):

Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
argon        24064M 70491  99 288158  25 129918  16 65296  97 428210  23 558.9   1
                     ------Sequential Create------ --------Random Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  16 23283  81 +++++ +++ 13775  56 20143  74 +++++ +++ 15152  54
argon,24064M,70491,99,288158,25,129918,16,65296,97,428210,23,558.9,1,16,23283,81,+++++,+++,13775,56,20143\
,74,+++++,+++,15152,54

pgbench -i -s 100 -U test
pgbench -c 10 -t 10000 -U test
     scaling factor: 100
     query mode: simple
     number of clients: 10
     number of transactions per client: 10000
     number of transactions actually processed: 100000/100000
     tps = 1046.104635 (including connections establishing)
     tps = 1046.337276 (excluding connections establishing)

Now the mystery: I moved the pg_xlog directory to a RAID1 array (same 3WARE controller, two more SATA 7200 disks).  Run
thesame tests and ... 

     tps = 82.325446 (including connections establishing)
     tps = 82.326874 (excluding connections establishing)

I thought I'd made a mistake, like maybe I moved the whole database to the RAID1 array, but I checked and double
checked. I even watched the lights blink - the WAL was definitely on the RAID1 and the rest of Postgres on the RAID10. 

So I moved the WAL back to the RAID10 array, and performance jumped right back up to the >1200 TPS range.

Next I check the RAID1 itself:

   dd if=/dev/zero of=./bigfile bs=8192 count=2000000

which yielded 98.8 MB/sec - not bad.  bonnie++ on the RAID1 pair showed good performance too:

Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
argon        24064M 68601  99 110057  18 46534   6 59883  90 123053   7 471.3   1
                     ------Sequential Create------ --------Random Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
argon,24064M,68601,99,110057,18,46534,6,59883,90,123053,7,471.3,1,16,+++++,+++,+++++,+++,+++++,+++,+++++,\
+++,+++++,+++,+++++,+++

So ... anyone have any idea at all how TPS drops to below 90 when I move the WAL to a separate RAID1 disk?  Does this
makeany sense at all?  It's repeatable. It happens for both ext4 and xfs. It's weird. 

You can even watch the disk lights and see it: the RAID10 disks are on almost constantly when the WAL is on the RAID10,
butwhen you move the WAL over to the RAID1, its lights are dim and flicker a lot, like it's barely getting any data,
andthe RAID10 disk's lights barely go on at all. 

Thanks,
Craig

Re: Weird XFS WAL problem

From

Mark Kirkwood

Date:

02 June 2010, 20:41:05

On 03/06/10 11:30, Craig James wrote:
> I'm testing/tuning a new midsize server and ran into an inexplicable
> problem.  With an RAID10 drive, when I move the WAL to a separate
> RAID1 drive, TPS drops from over 1200 to less than 90!   I've checked
> everything and can't find a reason.
>
>

Are the 2 new RAID1 disks the same make and model as the 12 RAID10 ones?

Also, are barriers *on* on the RAID1 mount and off on the RAID10 one?

Cheers

Mark

Re: Weird XFS WAL problem

From

Merlin Moncure

Date:

03 June 2010, 10:01:12

On Wed, Jun 2, 2010 at 7:30 PM, Craig James <craig_james@emolecules.com> wrote:
> I'm testing/tuning a new midsize server and ran into an inexplicable
> problem.  With an RAID10 drive, when I move the WAL to a separate RAID1
> drive, TPS drops from over 1200 to less than 90!   I've checked everything
> and can't find a reason.
>
> Here are the details.
>
> 8 cores (2x4 Intel Nehalem 2 GHz)
> 12 GB memory
> 12 x 7200 SATA 500 GB disks
> 3WARE 9650SE-12ML RAID controller with bbu
>  2 disks: RAID1  500GB ext4  blocksize=4096
>  8 disks: RAID10 2TB, stripe size 64K, blocksize=4096 (ext4 or xfs - see
> below)
>  2 disks: hot swap
> Ubuntu 10.04 LTS (Lucid)
>
> With xfs or ext4 on the RAID10 I got decent bonnie++ and pgbench results
> (this one is for xfs):
>
> Version 1.03e       ------Sequential Output------ --Sequential Input-
> --Random-
>                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec
> %CP
> argon        24064M 70491  99 288158  25 129918  16 65296  97 428210  23
> 558.9   1
>                    ------Sequential Create------ --------Random
> Create--------
>                    -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
>              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
> %CP
>                 16 23283  81 +++++ +++ 13775  56 20143  74 +++++ +++ 15152
>  54
> argon,24064M,70491,99,288158,25,129918,16,65296,97,428210,23,558.9,1,16,23283,81,+++++,+++,13775,56,20143\
> ,74,+++++,+++,15152,54
>
> pgbench -i -s 100 -U test
> pgbench -c 10 -t 10000 -U test
>    scaling factor: 100
>    query mode: simple
>    number of clients: 10
>    number of transactions per client: 10000
>    number of transactions actually processed: 100000/100000
>    tps = 1046.104635 (including connections establishing)
>    tps = 1046.337276 (excluding connections establishing)
>
> Now the mystery: I moved the pg_xlog directory to a RAID1 array (same 3WARE
> controller, two more SATA 7200 disks).  Run the same tests and ...
>
>    tps = 82.325446 (including connections establishing)
>    tps = 82.326874 (excluding connections establishing)
>
> I thought I'd made a mistake, like maybe I moved the whole database to the
> RAID1 array, but I checked and double checked.  I even watched the lights
> blink - the WAL was definitely on the RAID1 and the rest of Postgres on the
> RAID10.
>
> So I moved the WAL back to the RAID10 array, and performance jumped right
> back up to the >1200 TPS range.
>
> Next I check the RAID1 itself:
>
>  dd if=/dev/zero of=./bigfile bs=8192 count=2000000
>
> which yielded 98.8 MB/sec - not bad.  bonnie++ on the RAID1 pair showed good
> performance too:
>
> Version 1.03e       ------Sequential Output------ --Sequential Input-
> --Random-
>                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec
> %CP
> argon        24064M 68601  99 110057  18 46534   6 59883  90 123053   7
> 471.3   1
>                    ------Sequential Create------ --------Random
> Create--------
>                    -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
>              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
> %CP
>                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++
> +++
> argon,24064M,68601,99,110057,18,46534,6,59883,90,123053,7,471.3,1,16,+++++,+++,+++++,+++,+++++,+++,+++++,\
> +++,+++++,+++,+++++,+++
>
> So ... anyone have any idea at all how TPS drops to below 90 when I move the
> WAL to a separate RAID1 disk?  Does this make any sense at all?  It's
> repeatable. It happens for both ext4 and xfs. It's weird.
>
> You can even watch the disk lights and see it: the RAID10 disks are on
> almost constantly when the WAL is on the RAID10, but when you move the WAL
> over to the RAID1, its lights are dim and flicker a lot, like it's barely
> getting any data, and the RAID10 disk's lights barely go on at all.

*) Is your raid 1 configured writeback cache on the controller?
*) have you tried changing wal_sync_method to fdatasync?

merlin

Re: Weird XFS WAL problem

From

Greg Smith

Date:

03 June 2010, 13:52:57

Craig James wrote:
> I'm testing/tuning a new midsize server and ran into an inexplicable
> problem.  With an RAID10 drive, when I move the WAL to a separate
> RAID1 drive, TPS drops from over 1200 to less than 90!

Normally <100 TPS means that the write cache on the WAL drive volume is
disabled (or set to write-through instead of write-back).  When things
in this area get fishy, I will usually download sysbench and have it
specifically test how many fsync calls can happen per second.
http://projects.2ndquadrant.com/talks , "Database Hardware
Benchmarking", page 28 has an example of the right incantation for that.

Also, make sure you run 3ware's utilities and confirm all the disks have
finished their initialization and verification stages.  If you just
adjusted disk layout that and immediate launched into benchmarks, those
are useless until the background cleanup is done.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: Weird XFS WAL problem

From

Craig James

Date:

03 June 2010, 14:06:24

On 6/2/10 4:40 PM, Mark Kirkwood wrote:
> On 03/06/10 11:30, Craig James wrote:
>> I'm testing/tuning a new midsize server and ran into an inexplicable
>> problem. With an RAID10 drive, when I move the WAL to a separate RAID1
>> drive, TPS drops from over 1200 to less than 90! I've checked
>> everything and can't find a reason.
>
> Are the 2 new RAID1 disks the same make and model as the 12 RAID10 ones?

Yes.

> Also, are barriers *on* on the RAID1 mount and off on the RAID10 one?

It was the barriers.  "barrier=1" isn't just a bad idea on ext4, it's a disaster.

pgbench -i -s 100 -U test
pgbench -c 10 -t 10000 -U test

Change WAL to barrier=0

     tps = 1463.264981 (including connections establishing)
     tps = 1463.725687 (excluding connections establishing)

Change WAL to noatime, nodiratime, barrier=0

     tps = 1479.331476 (including connections establishing)
     tps = 1479.810545 (excluding connections establishing)

Change WAL to barrier=1

     tps = 82.325446 (including connections establishing)
     tps = 82.326874 (excluding connections establishing)

This is really hard to believe, because the bonnie++ numbers and dd(1) numbers look good (see my original post).  But
it'stotally repeatable.  It must be some really unfortunate "just missed the next sector going by the write head"
problem.

So with ext4, bonnie++ and dd aren't the whole story.

BTW, I also learned that if you edit /etc/fstab and use "mount -oremount" it WON'T change "barrier=0/1" unless it is
explicitin the fstab file.  That is, if you put "barrier=0" into /etc/fstab and use the remount, it will change it to
nobarriers.  But if you then remove it from /etc/fstab, it won't change it back to the default.  You have to actually
put"barrier=1" if you want to get it back to the default.  This seems like a bug to me, and it made it really hard to
trackthis down. "mount -oremount" is not the same as umount/mount! 

Craig

Re: Weird XFS WAL problem

From

Matthew Wakeling

Date:

03 June 2010, 14:14:17

On Thu, 3 Jun 2010, Craig James wrote:
>> Also, are barriers *on* on the RAID1 mount and off on the RAID10 one?
>
> It was the barriers.  "barrier=1" isn't just a bad idea on ext4, it's a
> disaster.

This worries me a little. Does your array have a battery-backed cache? If
so, then it should be fast regardless of barriers (although barriers may
make a small difference). If it does not, then it is likely that the fast
speed you are seeing with barriers off is unsafe.

There should be no "just missed the sector going past for write" problem
ever with a battery-backed cache.

Matthew

--
 There once was a limerick .sig
 that really was not very big
 It was going quite fine
 Till it reached the fourth line

Re: Weird XFS WAL problem

From

"Kevin Grittner"

Date:

03 June 2010, 14:31:18

Matthew Wakeling <matthew@flymine.org> wrote:
> On Thu, 3 Jun 2010, Craig James wrote:
>>> Also, are barriers *on* on the RAID1 mount and off on the RAID10
one?
>>
>> It was the barriers.  "barrier=1" isn't just a bad idea on ext4,
>> it's a disaster.
>
> This worries me a little. Does your array have a battery-backed
> cache? If so, then it should be fast regardless of barriers
> (although barriers may make a small difference). If it does not,
> then it is likely that the fast speed you are seeing with barriers
> off is unsafe.

I've seen this, too (with xfs).  Our RAID controller, in spite of
having BBU cache configured for writeback, waits for actual
persistence on disk for write barriers (unlike for fsync).  This
does strike me as surprising to the point of bordering on qualifying
as a bug.  It means that you can't take advantage of the BBU cache
and get the benefit of write barriers in OS cache behavior.  :-(

-Kevin

Re: Weird XFS WAL problem

From

Greg Smith

Date:

03 June 2010, 15:18:48

Kevin Grittner wrote:
> I've seen this, too (with xfs).  Our RAID controller, in spite of
> having BBU cache configured for writeback, waits for actual
> persistence on disk for write barriers (unlike for fsync).  This
> does strike me as surprising to the point of bordering on qualifying
> as a bug.
Completely intentional, and documented at
http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F

The issue is that XFS will actually send the full "flush your cache"
call to the controller, rather than just the usual fsync call, and that
eliminates the benefit of having a write cache there in the first
place.  Good controllers respect that and flush their whole write cache
out.  And ext4 has adopted the same mechanism.  This is very much a good
thing from the perspective of database reliability for people with
regular hard drives who don't have a useful write cache on their cheap
hard drives.  It allows them to keep the disk's write cache on for other
things, while still getting the proper cache flushes when the database
commits demand them.  It does mean that everyone with a non-volatile
battery backed cache, via RAID card typically, needs to turn barriers
off manually.

I've already warned on this list that PostgreSQL commit performance on
ext4 is going to appear really terrible to many people.  If you
benchmark and don't recognize ext3 wasn't operating in a reliable mode
before, the performance drop now that ext4 is doing the right thing with
barriers looks impossibly bad.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: Weird XFS WAL problem

From

Greg Smith

Date:

03 June 2010, 15:27:50

Craig James wrote:
> This is really hard to believe, because the bonnie++ numbers and dd(1)
> numbers look good (see my original post).  But it's totally
> repeatable.  It must be some really unfortunate "just missed the next
> sector going by the write head" problem.

Commit performance is a separate number to measure that is not reflected
in any benchmark that tests sequential performance.  I consider it the
fourth axis of disk system performance (seq read, seq write, random
IOPS, commit rate), and directly measure it with the sysbench fsync test
I recommended already.  (You can do it with the right custom pgbench
script too).

You only get one commit per rotation on a drive, which is exactly what
you're seeing:  a bit under the 120 spins/second @ 7200 RPM.  Attempts
to time things just right to catch more than one sector per spin are
extremely difficult to accomplish, I spent a week on that once without
making any good progress.  You can easily get 100MB/s on reads and
writes but only manage 100 commits/second.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: Weird XFS WAL problem

From

"Kevin Grittner"

Date:

03 June 2010, 15:40:59

Greg Smith <greg@2ndquadrant.com> wrote:
> Kevin Grittner wrote:
>> I've seen this, too (with xfs).  Our RAID controller, in spite of
>> having BBU cache configured for writeback, waits for actual
>> persistence on disk for write barriers (unlike for fsync).  This
>> does strike me as surprising to the point of bordering on
>> qualifying as a bug.
> Completely intentional, and documented at
>
http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F

Yeah, I read that long ago and I've disabled write barriers because
of it; however, it still seems wrong that the RAID controller
insists on flushing to the drives in write-back mode.  Here are my
reasons for wishing it was otherwise:

(1)  We've had batteries on our RAID controllers fail occasionally.
The controller automatically degrades to write-through, and we get
an email from the server and schedule a tech to travel to the site
and replace the battery; but until we take action we are now exposed
to possible database corruption.  Barriers don't automatically come
on when the controller flips to write-through mode.

(2)  It precludes any possibility of moving from fsync techniques to
write barrier techniques for ensuring database integrity.  If the OS
respected write barriers and the controller considered the write
satisfied when it hit BBU cache, write barrier techniques would
work, and checkpoints could be made smoother.  Think how nicely that
would inter-operate with point (1).

So, while I understand it's Working As Designed, I think the design
is surprising and sub-optimal.

-Kevin

Re: Weird XFS WAL problem

From

Scott Marlowe

Date:

03 June 2010, 16:16:37

On Thu, Jun 3, 2010 at 12:40 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
>
> Yeah, I read that long ago and I've disabled write barriers because
> of it; however, it still seems wrong that the RAID controller
> insists on flushing to the drives in write-back mode.  Here are my
> reasons for wishing it was otherwise:

I think it's a case of the quickest, simplest answer to semi-new tech.
 Not sure what to do with barriers?  Just flush the whole cache.

I'm guessing that this will get optimized in the future.

BTW, I'll have LSI Megaraid latest and greatest to test on in a month,
and older Areca 1680s as well. I'll be updating the firmware on the
arecas, and I'll run some tests on the whole barrier behaviour to see
if it's gotten any better lately.

Re: Weird XFS WAL problem

From

"Kevin Grittner"

Date:

03 June 2010, 16:17:43

Scott Marlowe <scott.marlowe@gmail.com> wrote:

> I think it's a case of the quickest, simplest answer to semi-new
> tech.  Not sure what to do with barriers?  Just flush the whole
> cache.
>
> I'm guessing that this will get optimized in the future.

Let's hope so.

That reminds me, the write barrier concept is at least on the
horizon as a viable technology; does anyone know if the asynchronous
graphs concept in this (one page) paper ever came to anything?  (I
haven't hear anything about it lately.)

http://www.usenix.org/events/fast05/wips/burnett.pdf

-Kevin

Re: Weird XFS WAL problem

From

Greg Smith

Date:

03 June 2010, 16:31:38

Scott Marlowe wrote:
> I think it's a case of the quickest, simplest answer to semi-new tech.
>  Not sure what to do with barriers?  Just flush the whole cache.
>

Well, that really is the only useful thing you can do with regular SATA
drives; the ATA command set isn't any finer grained than that in a way
that's useful for this context.  And it's also quite reasonable for a
RAID controller to respond to that "flush the whole cache" call by
flushing its cache.  So it's not just the simplest first answer, I
believe it's the only answer until a better ATA command set becomes
available.

I think this can only be resolved usefully for all of us at the RAID
firmware level.  If the controller had some logic that said "it's OK to
not flush the cache when that call comes in if my battery is working
fine", that would make this whole problem go away.  I don't expect it's
possible to work around the exact set of concerns Kevin listed any other
way, because as he pointed out the right thing to do is very dependent
on the battery health, which the OS also doesn't know (again, would
require some new command set verbage).

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: Weird XFS WAL problem

From

Scott Marlowe

Date:

03 June 2010, 16:44:32

On Thu, Jun 3, 2010 at 1:31 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Scott Marlowe wrote:
>>
>> I think it's a case of the quickest, simplest answer to semi-new tech.
>>  Not sure what to do with barriers?  Just flush the whole cache.
>>
>
> Well, that really is the only useful thing you can do with regular SATA
> drives; the ATA command set isn't any finer grained than that in a way
> that's useful for this context.  And it's also quite reasonable for a RAID
> controller to respond to that "flush the whole cache" call by flushing its
> cache.  So it's not just the simplest first answer, I believe it's the only
> answer until a better ATA command set becomes available.
>
> I think this can only be resolved usefully for all of us at the RAID
> firmware level.  If the controller had some logic that said "it's OK to not
> flush the cache when that call comes in if my battery is working fine",

That's what already happens for fsync on a BBU controller, so I don't
think the code to do so would be something fancy and new, just a
simple change of logic on which code path to take.

Re: Weird XFS WAL problem

From

"Kevin Grittner"

Date:

03 June 2010, 17:01:15

Greg Smith <greg@2ndquadrant.com> wrote:

> I think this can only be resolved usefully for all of us at the
> RAID firmware level.  If the controller had some logic that said
> "it's OK to not flush the cache when that call comes in if my
> battery is working fine", that would make this whole problem go
> away.

That is exactly what I've been trying to suggest.  Sorry for not
being more clear about it.

-Kevin

Re: Weird XFS WAL problem

From

Matthew Wakeling

Date:

04 June 2010, 06:27:18

On Thu, 3 Jun 2010, Greg Smith wrote:
> And it's also quite reasonable for a RAID controller to respond to that
> "flush the whole cache" call by flushing its cache.

Remember that the RAID controller is presenting itself to the OS as a
large disc, and hiding the individual discs from the OS. Why should the OS
care what has actually happened to the individual discs' caches, as long
as that "flush the whole cache" command guarantees that the data is
persistent. Taking the RAID array as a whole, that happens when the data
hits the write-back cache.

The only circumstance where you actually need to flush the data to the
individual discs is when you need to take that disc away somewhere else
and read it on another system. That's quite a rare use case for a RAID
array (http://thedailywtf.com/Articles/RAIDing_Disks.aspx
notwithstanding).

> If the controller had some logic that said "it's OK to not flush the
> cache when that call comes in if my battery is working fine", that would
> make this whole problem go away.

The only place this can be properly sorted is the RAID controller.
Anywhere else would be crazy.

Matthew

--
"To err is human; to really louse things up requires root
 privileges."                 -- Alexander Pope, slightly paraphrased

Re: Weird XFS WAL problem

From

Bruce Momjian

Date:

04 June 2010, 12:06:58

Greg Smith wrote:
> Kevin Grittner wrote:
> > I've seen this, too (with xfs).  Our RAID controller, in spite of
> > having BBU cache configured for writeback, waits for actual
> > persistence on disk for write barriers (unlike for fsync).  This
> > does strike me as surprising to the point of bordering on qualifying
> > as a bug.
> Completely intentional, and documented at
> http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F
>
> The issue is that XFS will actually send the full "flush your cache"
> call to the controller, rather than just the usual fsync call, and that
> eliminates the benefit of having a write cache there in the first
> place.  Good controllers respect that and flush their whole write cache
> out.  And ext4 has adopted the same mechanism.  This is very much a good
> thing from the perspective of database reliability for people with
> regular hard drives who don't have a useful write cache on their cheap
> hard drives.  It allows them to keep the disk's write cache on for other
> things, while still getting the proper cache flushes when the database
> commits demand them.  It does mean that everyone with a non-volatile
> battery backed cache, via RAID card typically, needs to turn barriers
> off manually.
>
> I've already warned on this list that PostgreSQL commit performance on
> ext4 is going to appear really terrible to many people.  If you
> benchmark and don't recognize ext3 wasn't operating in a reliable mode
> before, the performance drop now that ext4 is doing the right thing with
> barriers looks impossibly bad.

Well, this is depressing.  Now that we finally have common
battery-backed cache RAID controller cards, the file system developers
have throw down another roadblock in ext4 and xfs.  Do we need to
document this?

On another topic, I am a little unclear on how things behave when the
drive is write-back. If the RAID controller card writes to the drive,
but the data isn't on the platers, how does it know when it can discard
that information from the BBU RAID cache?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + None of us is going to be here forever. +

Re: Weird XFS WAL problem

From

"Kevin Grittner"

Date:

04 June 2010, 12:16:59

Bruce Momjian <bruce@momjian.us> wrote:

> On another topic, I am a little unclear on how things behave when
> the drive is write-back. If the RAID controller card writes to the
> drive, but the data isn't on the platers, how does it know when it
> can discard that information from the BBU RAID cache?

The controller waits for the drive to tell it that it has made it to
the platter before it discards it.  What made you think otherwise?

-Kevin

Re: Weird XFS WAL problem

From

Bruce Momjian

Date:

04 June 2010, 12:18:23

Kevin Grittner wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
>
> > On another topic, I am a little unclear on how things behave when
> > the drive is write-back. If the RAID controller card writes to the
> > drive, but the data isn't on the platers, how does it know when it
> > can discard that information from the BBU RAID cache?
>
> The controller waits for the drive to tell it that it has made it to
> the platter before it discards it.  What made you think otherwise?

Because a write-back drive cache says it is on the drive before it hits
the platters, which I think is the default for SATA drive.  Is that
inaccurate?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + None of us is going to be here forever. +

Re: Weird XFS WAL problem

From

"Kevin Grittner"

Date:

04 June 2010, 12:24:07

Bruce Momjian <bruce@momjian.us> wrote:
> Kevin Grittner wrote:

>> The controller waits for the drive to tell it that it has made it
>> to the platter before it discards it.  What made you think
>> otherwise?
>
> Because a write-back drive cache says it is on the drive before it
> hits the platters, which I think is the default for SATA drive.
> Is that inaccurate?

Any decent RAID controller will ensure that the drives themselves
aren't using write-back caching.  When we've mentioned write-back
versus write-through on this thread we've been talking about the
behavior of the *controller*.  We have our controllers configured to
use write-back through the BBU cache as long as the battery is good,
but to automatically switch to write-through if the battery goes
bad.

-Kevin

Re: Weird XFS WAL problem

From

Bruce Momjian

Date:

04 June 2010, 12:30:29

Kevin Grittner wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
> > Kevin Grittner wrote:
>
> >> The controller waits for the drive to tell it that it has made it
> >> to the platter before it discards it.  What made you think
> >> otherwise?
> >
> > Because a write-back drive cache says it is on the drive before it
> > hits the platters, which I think is the default for SATA drive.
> > Is that inaccurate?
>
> Any decent RAID controller will ensure that the drives themselves
> aren't using write-back caching.  When we've mentioned write-back
> versus write-through on this thread we've been talking about the
> behavior of the *controller*.  We have our controllers configured to
> use write-back through the BBU cache as long as the battery is good,
> but to automatically switch to write-through if the battery goes
> bad.

OK, good, but when why would a BBU RAID controller flush stuff to disk
with a flush-all command?  I thought the whole goal of BBU was to avoid
such flushes.  What is unique about the command ext4/xfs is sending?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + None of us is going to be here forever. +

Re: Weird XFS WAL problem

From

"Kevin Grittner"

Date:

04 June 2010, 12:36:10

Bruce Momjian <bruce@momjian.us> wrote:
> Kevin Grittner wrote:

>> Any decent RAID controller will ensure that the drives themselves
>> aren't using write-back caching.  When we've mentioned write-back
>> versus write-through on this thread we've been talking about the
>> behavior of the *controller*.  We have our controllers configured
>> to use write-back through the BBU cache as long as the battery is
>> good, but to automatically switch to write-through if the battery
>> goes bad.
>
> OK, good, but when why would a BBU RAID controller flush stuff to
> disk with a flush-all command?  I thought the whole goal of BBU
> was to avoid such flushes.

That has been *precisely* my point.

I don't know at the protocol level; I just know that write barriers
do *something* which causes our controllers to wait for actual disk
platter persistence, while fsync does not.

The write barrier concept seems good to me, and I wish it could be
used at the OS level without killing performance.  I blame the
controller, for not treating it the same as fsync (i.e., as long as
it's in write-back mode it should treat data as persisted as soon as
it's in BBU cache).

-Kevin

Re: Weird XFS WAL problem

From

Bruce Momjian

Date:

04 June 2010, 12:42:00

Kevin Grittner wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
> > Kevin Grittner wrote:
>
> >> Any decent RAID controller will ensure that the drives themselves
> >> aren't using write-back caching.  When we've mentioned write-back
> >> versus write-through on this thread we've been talking about the
> >> behavior of the *controller*.  We have our controllers configured
> >> to use write-back through the BBU cache as long as the battery is
> >> good, but to automatically switch to write-through if the battery
> >> goes bad.
> >
> > OK, good, but when why would a BBU RAID controller flush stuff to
> > disk with a flush-all command?  I thought the whole goal of BBU
> > was to avoid such flushes.
>
> That has been *precisely* my point.
>
> I don't know at the protocol level; I just know that write barriers
> do *something* which causes our controllers to wait for actual disk
> platter persistence, while fsync does not.
>
> The write barrier concept seems good to me, and I wish it could be
> used at the OS level without killing performance.  I blame the
> controller, for not treating it the same as fsync (i.e., as long as
> it's in write-back mode it should treat data as persisted as soon as
> it's in BBU cache).

Yeah.  I wonder if it honors the cache flush because it might think it
is replacing disks or something odd.  I think we are going to have to
document this in 9.0 because obviously you have seen it already.

Is this an issue with SAS cards/drives as well?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + None of us is going to be here forever. +

Re: Weird XFS WAL problem

From

Greg Smith

Date:

05 June 2010, 19:50:51

Kevin Grittner wrote:
> I don't know at the protocol level; I just know that write barriers
> do *something* which causes our controllers to wait for actual disk
> platter persistence, while fsync does not

It's in the docs now:
http://www.postgresql.org/docs/9.0/static/wal-reliability.html

FLUSH CACHE EXT is the ATAPI-6 call that filesystems use to enforce
barriers on that type of drive.  Here's what the relevant portion of the
ATAPI spec says:

"This command is used by the host to request the device to flush the
write cache. If there is data in the write
cache, that data shall be written to the media.The BSY bit shall remain
set to one until all data has been
successfully written or an error occurs."

SAS systems have a similar call named SYNCHRONIZE CACHE.

The improvement I actually expect to arrive here first is a reliable
implementation of O_SYNC/O_DSYNC writes.  Both SAS and SATA drives that
capable of doing Native Command Queueing support a write type called
"Force Unit Access", which is essentially just like a direct write that
cannot be cached.  When we get more kernels with reliable sync writing
that maps under the hood to FUA, and can change wal_sync_method to use
them, the need to constantly call fsync for every write to the WAL will
go away.  Then the "blow out the RAID cache when barriers are on"
behavior will only show up during checkpoint fsyncs, which will make
things a lot better (albeit still not ideal).

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: Weird XFS WAL problem

From

Bruce Momjian

Date:

07 July 2010, 11:43:22

Greg Smith wrote:
> Kevin Grittner wrote:
> > I don't know at the protocol level; I just know that write barriers
> > do *something* which causes our controllers to wait for actual disk
> > platter persistence, while fsync does not
>
> It's in the docs now:
> http://www.postgresql.org/docs/9.0/static/wal-reliability.html
>
> FLUSH CACHE EXT is the ATAPI-6 call that filesystems use to enforce
> barriers on that type of drive.  Here's what the relevant portion of the
> ATAPI spec says:
>
> "This command is used by the host to request the device to flush the
> write cache. If there is data in the write
> cache, that data shall be written to the media.The BSY bit shall remain
> set to one until all data has been
> successfully written or an error occurs."
>
> SAS systems have a similar call named SYNCHRONIZE CACHE.
>
> The improvement I actually expect to arrive here first is a reliable
> implementation of O_SYNC/O_DSYNC writes.  Both SAS and SATA drives that
> capable of doing Native Command Queueing support a write type called
> "Force Unit Access", which is essentially just like a direct write that
> cannot be cached.  When we get more kernels with reliable sync writing
> that maps under the hood to FUA, and can change wal_sync_method to use
> them, the need to constantly call fsync for every write to the WAL will
> go away.  Then the "blow out the RAID cache when barriers are on"
> behavior will only show up during checkpoint fsyncs, which will make
> things a lot better (albeit still not ideal).

Great information!  I have added the attached documentation patch to
explain the write-barrier/BBU interaction.  This will appear in the 9.0
documentation.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + None of us is going to be here forever. +
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.66
diff -c -c -r1.66 wal.sgml
*** doc/src/sgml/wal.sgml    13 Apr 2010 14:15:25 -0000    1.66
--- doc/src/sgml/wal.sgml    7 Jul 2010 13:55:58 -0000
***************
*** 48,68 ****
     some later time. Such caches can be a reliability hazard because the
     memory in the disk controller cache is volatile, and will lose its
     contents in a power failure.  Better controller cards have
!    <firstterm>battery-backed</> caches, meaning the card has a battery that
     maintains power to the cache in case of system power loss.  After power
     is restored the data will be written to the disk drives.
    </para>

    <para>
     And finally, most disk drives have caches. Some are write-through
!    while some are write-back, and the
!    same concerns about data loss exist for write-back drive caches as
!    exist for disk controller caches.  Consumer-grade IDE and SATA drives are
!    particularly likely to have write-back caches that will not survive a
!    power failure, though <acronym>ATAPI-6</> introduced a drive cache
!    flush command (FLUSH CACHE EXT) that some file systems use, e.g. <acronym>ZFS</>.
!    Many solid-state drives (SSD) also have volatile write-back
!    caches, and many do not honor cache flush commands by default.
     To check write caching on <productname>Linux</> use
     <command>hdparm -I</>;  it is enabled if there is a <literal>*</> next
     to <literal>Write cache</>; <command>hdparm -W</> to turn off
--- 48,74 ----
     some later time. Such caches can be a reliability hazard because the
     memory in the disk controller cache is volatile, and will lose its
     contents in a power failure.  Better controller cards have
!    <firstterm>battery-backed unit</> (<acronym>BBU</>) caches, meaning
!    the card has a battery that
     maintains power to the cache in case of system power loss.  After power
     is restored the data will be written to the disk drives.
    </para>

    <para>
     And finally, most disk drives have caches. Some are write-through
!    while some are write-back, and the same concerns about data loss
!    exist for write-back drive caches as exist for disk controller
!    caches.  Consumer-grade IDE and SATA drives are particularly likely
!    to have write-back caches that will not survive a power failure,
!    though <acronym>ATAPI-6</> introduced a drive cache flush command
!    (<command>FLUSH CACHE EXT</>) that some file systems use, e.g.
!    <acronym>ZFS</>, <acronym>ext4</>.  (The SCSI command
!    <command>SYNCHRONIZE CACHE</> has long been available.) Many
!    solid-state drives (SSD) also have volatile write-back caches, and
!    many do not honor cache flush commands by default.
!   </para>
!
!   <para>
     To check write caching on <productname>Linux</> use
     <command>hdparm -I</>;  it is enabled if there is a <literal>*</> next
     to <literal>Write cache</>; <command>hdparm -W</> to turn off
***************
*** 83,88 ****
--- 89,113 ----
    </para>

    <para>
+    Many file systems that use write barriers (e.g.  <acronym>ZFS</>,
+    <acronym>ext4</>) internally use <command>FLUSH CACHE EXT</> or
+    <command>SYNCHRONIZE CACHE</> commands to flush data to the platers on
+    write-back-enabled drives.  Unfortunately, such write barrier file
+    systems behave suboptimally when combined with battery-backed unit
+    (<acronym>BBU</>) disk controllers.  In such setups, the synchronize
+    command forces all data from the BBU to the disks, eliminating much
+    of the benefit of the BBU.  You can run the utility
+    <filename>src/tools/fsync</> in the PostgreSQL source tree to see
+    if you are effected.  If you are effected, the performance benefits
+    of the BBU cache can be regained by turning off write barriers in
+    the file system or reconfiguring the disk controller, if that is
+    an option.  If write barriers are turned off, make sure the battery
+    remains active; a faulty battery can potentially lead to data loss.
+    Hopefully file system and disk controller designers will eventually
+    address this suboptimal behavior.
+   </para>
+
+   <para>
     When the operating system sends a write request to the storage hardware,
     there is little it can do to make sure the data has arrived at a truly
     non-volatile storage area. Rather, it is the