Thread: Air-traffic benchmark

Air-traffic benchmark

From
Lefteris
Date:
Hi all,

following the simple but interesting air-traffic benchmark published at:
http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/

I decided to run the benchmark over postgres to get some more
experience and insights. Unfortunately, the query times I got from
postgres were not the expected ones: the fast ones were in the order
of 5 minutes while the slow ones in the order of 40 minutes. I came to
the conclusion that some parameters are not set correctly. I give all
the information I can think of in the sequel of this email, to
hopefully get some useful suggestions to improve query times. Thank
you very much for your time!

All of the schema, data and query information can be found on the
original blog post given above.

The machine runs Fedora 10:
uname -a:
Linux 2.6.27.30-170.2.82.fc10.x86_64 #1 SMP Mon Aug 17 08:18:34 EDT
2009 x86_64 x86_64 x86_64 GNU/Linux

The hardware characteristics are:
Platform Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz with 8GB RAM and
ample disk space (2x 500 GB SATA disk @ 7200 RPM as SW-RAID-0)

I used  postgresql-8.4.2. Source download and compiled by hand, no
special parameters where passed to the configure phase.

-- DETAILS on loading, analyze and index creation on postgresql-8.4.2
-- loading time was 77m15.976s
-- total rows: 119790558
-- total size of pgdata/base: 46G
ANALYZE;
-- 219698.365ms aka 3m39.698s
CREATE INDEX year_month_idx ON ontime ("Year", "Month");
-- 517481.632ms aka 8m37.481s
CREATE INDEX DepDelay_idx ON ontime ("DepDelay");
-- 1257051.668ms aka 20m57.051s

Query 1:

SELECT "DayOfWeek", count(*) AS c FROM ontime WHERE "Year" BETWEEN
2000 AND 2008 GROUP BY "DayOfWeek" ORDER BY c DESC;

Reported query times are (in sec):
MonetDB 7.9s
InfoBright 12.13s
LucidDB 54.8s

For pg-8.4.2  I got with 3 consecutive runs on the server:
5m52.384s
5m55.885s
5m54.309s

For more complex queries, postgres will become even more slower, that
is for queries from Q2 to Q8 will need over 40 minutes, while on the
rest ones around 5 minutes. By examining the explain analyze in these
queries, the main slow down factor is the HashAggregate and mainly
Sort operations. For query 1 the explain analyze returns:

airtraffic=# EXPLAIN ANALYZE SELECT "DayOfWeek", count(*) AS c FROM
ontime WHERE "Year" BETWEEN 2000 AND 2008 GROUP BY "DayOfWeek" ORDER
BY c DESC;
                                                               QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=7407754.12..7407754.13 rows=4 width=2) (actual
time=371188.821..371188.823 rows=7 loops=1)
  Sort Key: (count(*))
  Sort Method:  quicksort  Memory: 25kB
  ->  HashAggregate  (cost=7407754.04..7407754.08 rows=4 width=2)
(actual time=371163.316..371163.320 rows=7 loops=1)
        ->  Seq Scan on ontime  (cost=0.00..7143875.40 rows=52775727
width=2) (actual time=190938.959..346180.079 rows=52484047 loops=1)
              Filter: (("Year" >= 2000) AND ("Year" <= 2008))
 Total runtime: 371201.156 ms
(7 rows)


I understand that the problem here is the memory used by the sort
method. *But*, I already changed the work_mem parameter to 6gigs:)

airtraffic=# show work_mem;
 work_mem
----------
 6GB
(1 row)

And when I saw that this did not work, I also changed the 8.3 style
(?) parameter sort_mem:
airtraffic=# show sort_mem;
 work_mem
----------
 6GB
(1 row)

which means that of course work_mem == sort_mem, as
such, shouldn't be the case that the sort algorithm should have used
much more memory?

I also attach the output of 'show all;' so you can advice me in any
other configuration settings that I might need to change to perform
better.


Thank you very much for your feedback and all of your help!

lefteris sidirourgos

P.S. sorry if you get this message twice but the first one does not
appear to have reached the list



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
airtraffic=# show all;

              name               |                         setting
                    |
        description

---------------------------------+---------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------
 add_missing_from                | off
                    | Automatically adds missing table references to
FROM clauses.
 allow_system_table_mods         | off
                    | Allows modifications of the structure of system
tables.
 archive_command                 | (disabled)
                    | Sets the shell command that will be called to
archive a WAL file.
 archive_mode                    | off
                    | Allows archiving of WAL files using
archive_command.
 archive_timeout                 | 0
                    | Forces a switch to the next xlog file if a new
file has not been started within N seconds.
 array_nulls                     | on
                    | Enable input of NULL elements in arrays.
 authentication_timeout          | 1min
                    | Sets the maximum allowed time to complete client
authentication.
 autovacuum                      | on
                    | Starts the autovacuum subprocess.
 autovacuum_analyze_scale_factor | 0.1
                    | Number of tuple inserts, updates or deletes
prior to analyze as a fraction of reltuples.
 autovacuum_analyze_threshold    | 50
                    | Minimum number of tuple inserts, updates or
deletes prior to analyze.
 autovacuum_freeze_max_age       | 200000000
                    | Age at which to autovacuum a table to prevent
transaction ID wraparound.
 autovacuum_max_workers          | 3
                    | Sets the maximum number of simultaneously
running autovacuum worker processes.
 autovacuum_naptime              | 1min
                    | Time to sleep between autovacuum runs.
 autovacuum_vacuum_cost_delay    | 20ms
                    | Vacuum cost delay in milliseconds, for
autovacuum.
 autovacuum_vacuum_cost_limit    | -1
                    | Vacuum cost amount available before napping, for
autovacuum.
 autovacuum_vacuum_scale_factor  | 0.2
                    | Number of tuple updates or deletes prior to
vacuum as a fraction of reltuples.
 autovacuum_vacuum_threshold     | 50
                    | Minimum number of tuple updates or deletes prior
to vacuum.
 backslash_quote                 | safe_encoding
                    | Sets whether "\'" is allowed in string literals.
 bgwriter_delay                  | 200ms
                    | Background writer sleep time between rounds.
 bgwriter_lru_maxpages           | 100
                    | Background writer maximum number of LRU pages to
flush per round.
 bgwriter_lru_multiplier         | 2
                    | Multiple of the average buffer usage to free per
round.
 block_size                      | 8192
                    | Shows the size of a disk block.
 bonjour_name                    |
                    | Sets the Bonjour broadcast service name.
 check_function_bodies           | on
                    | Check function bodies during CREATE FUNCTION.
 checkpoint_completion_target    | 0.5
                    | Time spent flushing dirty buffers during
checkpoint, as fraction of checkpoint interval.
 checkpoint_segments             | 30
                    | Sets the maximum distance in log segments
between automatic WAL checkpoints.
 checkpoint_timeout              | 5min
                    | Sets the maximum time between automatic WAL
checkpoints.
 checkpoint_warning              | 30s
                    | Enables warnings if checkpoint segments are
filled more frequently than this.
 client_encoding                 | UTF8
                    | Sets the client's character set encoding.
 client_min_messages             | notice
                    | Sets the message levels that are sent to the
client.
 commit_delay                    | 0
                    | Sets the delay in microseconds between
transaction commit and flushing WAL to disk.
 commit_siblings                 | 5
                    | Sets the minimum concurrent open transactions
before performing commit_delay.
 config_file                     |
/export/scratch0/lsidir/postgres/pgdata/postgresql.conf | Sets the
server's main configuration file.
 constraint_exclusion            | partition
                    | Enables the planner to use constraints to
optimize queries.
 cpu_index_tuple_cost            | 0.005
                    | Sets the planner's estimate of the cost of
processing each index entry during an index scan.
 cpu_operator_cost               | 0.0025
                    | Sets the planner's estimate of the cost of
processing each operator or function call.
 cpu_tuple_cost                  | 0.01
                    | Sets the planner's estimate of the cost of
processing each tuple (row).
 cursor_tuple_fraction           | 0.1
                    | Sets the planner's estimate of the fraction of a
cursor's rows that will be retrieved.
 custom_variable_classes         |
                    | Sets the list of known custom variable classes.
 data_directory                  |
/export/scratch0/lsidir/postgres/pgdata                 | Sets the
server's data directory.
 DateStyle                       | ISO, DMY
                    | Sets the display format for date and time
values.
 db_user_namespace               | off
                    | Enables per-database user names.
 deadlock_timeout                | 1s
                    | Sets the time to wait on a lock before checking
for deadlock.
 debug_assertions                | off
                    | Turns on various assertion checks.
 debug_pretty_print              | on
                    | Indents parse and plan tree displays.
 debug_print_parse               | off
                    | Logs each query's parse tree.
 debug_print_plan                | off
                    | Logs each query's execution plan.
 debug_print_rewritten           | off
                    | Logs each query's rewritten parse tree.
 default_statistics_target       | 100
                    | Sets the default statistics target.
 default_tablespace              |
                    | Sets the default tablespace to create tables and
indexes in.
 default_text_search_config      | pg_catalog.english
                    | Sets default text search configuration.
 default_transaction_isolation   | read committed
                    | Sets the transaction isolation level of each new
transaction.
 default_transaction_read_only   | off
                    | Sets the default read-only status of new
transactions.
 default_with_oids               | off
                    | Create new tables with OIDs by default.
 dynamic_library_path            | $libdir
                    | Sets the path for dynamically loadable modules.
 effective_cache_size            | 8MB
                    | Sets the planner's assumption about the size of
the disk cache.
 effective_io_concurrency        | 1
                    | Number of simultaneous requests that can be
handled efficiently by the disk subsystem.
 enable_bitmapscan               | on
                    | Enables the planner's use of bitmap-scan plans.
 enable_hashagg                  | on
                    | Enables the planner's use of hashed aggregation
plans.
 enable_hashjoin                 | on
                    | Enables the planner's use of hash join plans.
 enable_indexscan                | on
                    | Enables the planner's use of index-scan plans.
 enable_mergejoin                | on
                    | Enables the planner's use of merge join plans.
 enable_nestloop                 | on
                    | Enables the planner's use of nested-loop join
plans.
 enable_seqscan                  | on
                    | Enables the planner's use of sequential-scan
plans.
 enable_sort                     | on
                    | Enables the planner's use of explicit sort
steps.
 enable_tidscan                  | on
                    | Enables the planner's use of TID scan plans.
 escape_string_warning           | on
                    | Warn about backslash escapes in ordinary string
literals.
 external_pid_file               |
                    | Writes the postmaster PID to the specified file.
 extra_float_digits              | 0
                    | Sets the number of digits displayed for
floating-point values.
 from_collapse_limit             | 8
                    | Sets the FROM-list size beyond which subqueries
are not collapsed.
 fsync                           | on
                    | Forces synchronization of updates to disk.
 full_page_writes                | on
                    | Writes full pages to WAL when first modified
after a checkpoint.
 geqo                            | on
                    | Enables genetic query optimization.
 geqo_effort                     | 5
                    | GEQO: effort is used to set the default for
other GEQO parameters.
 geqo_generations                | 0
                    | GEQO: number of iterations of the algorithm.
 geqo_pool_size                  | 0
                    | GEQO: number of individuals in the population.
 geqo_selection_bias             | 2
                    | GEQO: selective pressure within the population.
 geqo_threshold                  | 12
                    | Sets the threshold of FROM items beyond which
GEQO is used.
 gin_fuzzy_search_limit          | 0
                    | Sets the maximum allowed result for exact search
by GIN.
 hba_file                        |
/export/scratch0/lsidir/postgres/pgdata/pg_hba.conf     | Sets the
server's "hba" configuration file.
 ident_file                      |
/export/scratch0/lsidir/postgres/pgdata/pg_ident.conf   | Sets the
server's "ident" configuration file.
 ignore_system_indexes           | off
                    | Disables reading from system indexes.
 integer_datetimes               | on
                    | Datetimes are integer based.
 IntervalStyle                   | postgres
                    | Sets the display format for interval values.
 join_collapse_limit             | 8
                    | Sets the FROM-list size beyond which JOIN
constructs are not flattened.
 krb_caseins_users               | off
                    | Sets whether Kerberos and GSSAPI user names
should be treated as case-insensitive.
 krb_server_keyfile              |
                    | Sets the location of the Kerberos server key
file.
 krb_srvname                     | postgres
                    | Sets the name of the Kerberos service.
 lc_collate                      | en_GB.UTF-8
                    | Shows the collation order locale.
 lc_ctype                        | en_GB.UTF-8
                    | Shows the character classification and case
conversion locale.
 lc_messages                     | en_GB.UTF-8
                    | Sets the language in which messages are
displayed.
 lc_monetary                     | en_GB.UTF-8
                    | Sets the locale for formatting monetary amounts.
 lc_numeric                      | en_GB.UTF-8
                    | Sets the locale for formatting numbers.
 lc_time                         | en_GB.UTF-8
                    | Sets the locale for formatting date and time
values.
 listen_addresses                | localhost
                    | Sets the host name or IP address(es) to listen
to.
 local_preload_libraries         |
                    | Lists shared libraries to preload into each
backend.
 log_autovacuum_min_duration     | -1
                    | Sets the minimum execution time above which
autovacuum actions will be logged.
 log_checkpoints                 | off
                    | Logs each checkpoint.
 log_connections                 | off
                    | Logs each successful connection.
 log_destination                 | stderr
                    | Sets the destination for server log output.
 log_directory                   | pg_log
                    | Sets the destination directory for log files.
 log_disconnections              | off
                    | Logs end of a session, including duration.
 log_duration                    | off
                    | Logs the duration of each completed SQL
statement.
 log_error_verbosity             | default
                    | Sets the verbosity of logged messages.
 log_executor_stats              | off
                    | Writes executor performance statistics to the
server log.
 log_filename                    | postgresql-%Y-%m-%d_%H%M%S.log
                    | Sets the file name pattern for log files.
 log_hostname                    | off
                    | Logs the host name in the connection logs.
 log_line_prefix                 |
                    | Controls information prefixed to each log line.
 log_lock_waits                  | off
                    | Logs long lock waits.
 log_min_duration_statement      | -1
                    | Sets the minimum execution time above which
statements will be logged.
 log_min_error_statement         | error
                    | Causes all statements generating error at or
above this level to be logged.
 log_min_messages                | warning
                    | Sets the message levels that are logged.
 log_parser_stats                | off
                    | Writes parser performance statistics to the
server log.
 log_planner_stats               | off
                    | Writes planner performance statistics to the
server log.
 log_rotation_age                | 1d
                    | Automatic log file rotation will occur after N
minutes.
 log_rotation_size               | 10MB
                    | Automatic log file rotation will occur after N
kilobytes.
 log_statement                   | none
                    | Sets the type of statements logged.
 log_statement_stats             | off
                    | Writes cumulative performance statistics to the
server log.
 log_temp_files                  | -1
                    | Log the use of temporary files larger than this
number of kilobytes.
 log_timezone                    | Europe/Amsterdam
                    | Sets the time zone to use in log messages.
 log_truncate_on_rotation        | off
                    | Truncate existing log files of same name during
log rotation.
 logging_collector               | off
                    | Start a subprocess to capture stderr output
and/or csvlogs into log files.
 maintenance_work_mem            | 2GB
                    | Sets the maximum memory to be used for
maintenance operations.
 max_connections                 | 100
                    | Sets the maximum number of concurrent
connections.
 max_files_per_process           | 1000
                    | Sets the maximum number of simultaneously open
files for each server process.
 max_function_args               | 100
                    | Shows the maximum number of function arguments.
 max_identifier_length           | 63
                    | Shows the maximum identifier length.
 max_index_keys                  | 32
                    | Shows the maximum number of index keys.
 max_locks_per_transaction       | 64
                    | Sets the maximum number of locks per
transaction.
 max_prepared_transactions       | 0
                    | Sets the maximum number of simultaneously
prepared transactions.
 max_stack_depth                 | 2MB
                    | Sets the maximum stack depth, in kilobytes.
 password_encryption             | on
                    | Encrypt passwords.
 port                            | 5432
                    | Sets the TCP port the server listens on.
 post_auth_delay                 | 0
                    | Waits N seconds on connection startup after
authentication.
 pre_auth_delay                  | 0
                    | Waits N seconds on connection startup before
authentication.
 random_page_cost                | 4
                    | Sets the planner's estimate of the cost of a
nonsequentially fetched disk page.
 regex_flavor                    | advanced
                    | Sets the regular expression "flavor".
 search_path                     | "$user",public
                    | Sets the schema search order for names that are
not schema-qualified.
 segment_size                    | 1GB
                    | Shows the number of pages per disk file.
 seq_page_cost                   | 1
                    | Sets the planner's estimate of the cost of a
sequentially fetched disk page.
 server_encoding                 | UTF8
                    | Sets the server (database) character set
encoding.
 server_version                  | 8.4.2
                    | Shows the server version.
 server_version_num              | 80402
                    | Shows the server version as an integer.
 session_replication_role        | origin
                    | Sets the session's behavior for triggers and
rewrite rules.
 shared_buffers                  | 32MB
                    | Sets the number of shared memory buffers used by
the server.
 shared_preload_libraries        |
                    | Lists shared libraries to preload into server.
 silent_mode                     | off
                    | Runs the server silently.
 sql_inheritance                 | on
                    | Causes subtables to be included by default in
various commands.
 ssl                             | off
                    | Enables SSL connections.
 standard_conforming_strings     | off
                    | Causes '...' strings to treat backslashes
literally.
 statement_timeout               | 0
                    | Sets the maximum allowed duration of any
statement.
 stats_temp_directory            | pg_stat_tmp
                    | Writes temporary statistics files to the
specified directory.
 superuser_reserved_connections  | 3
                    | Sets the number of connection slots reserved for
superusers.
 synchronize_seqscans            | on
                    | Enable synchronized sequential scans.
 synchronous_commit              | on
                    | Sets immediate fsync at commit.
 syslog_facility                 | local0
                    | Sets the syslog "facility" to be used when
syslog enabled.
 syslog_ident                    | postgres
                    | Sets the program name used to identify
PostgreSQL messages in syslog.
 tcp_keepalives_count            | 0
                    | Maximum number of TCP keepalive retransmits.
 tcp_keepalives_idle             | 0
                    | Time between issuing TCP keepalives.
 tcp_keepalives_interval         | 0
                    | Time between TCP keepalive retransmits.
 temp_buffers                    | 1024
                    | Sets the maximum number of temporary buffers
used by each session.
 temp_tablespaces                |
                    | Sets the tablespace(s) to use for temporary
tables and sort files.
 TimeZone                        | Europe/Amsterdam
                    | Sets the time zone for displaying and
interpreting time stamps.
 timezone_abbreviations          | Default
                    | Selects a file of time zone abbreviations.
 trace_notify                    | off
                    | Generates debugging output for LISTEN and
NOTIFY.
 trace_sort                      | off
                    | Emit information about resource usage in
sorting.
 track_activities                | on
                    | Collects information about executing commands.
 track_activity_query_size       | 1024
                    | Sets the size reserved for
pg_stat_activity.current_query, in bytes.
 track_counts                    | on
                    | Collects statistics on database activity.
 track_functions                 | none
                    | Collects function-level statistics on database
activity.
 transaction_isolation           | read committed
                    | Sets the current transaction's isolation level.
 transaction_read_only           | off
                    | Sets the current transaction's read-only status.
 transform_null_equals           | off
                    | Treats "expr=NULL" as "expr IS NULL".
 unix_socket_directory           |
                    | Sets the directory where the Unix-domain socket
will be created.
 unix_socket_group               |
                    | Sets the owning group of the Unix-domain socket.
 unix_socket_permissions         | 511
                    | Sets the access permissions of the Unix-domain
socket.
 update_process_title            | on
                    | Updates the process title to show the active SQL
command.
 vacuum_cost_delay               | 0
                    | Vacuum cost delay in milliseconds.
 vacuum_cost_limit               | 200
                    | Vacuum cost amount available before napping.
 vacuum_cost_page_dirty          | 20
                    | Vacuum cost for a page dirtied by vacuum.
 vacuum_cost_page_hit            | 1
                    | Vacuum cost for a page found in the buffer
cache.
 vacuum_cost_page_miss           | 10
                    | Vacuum cost for a page not found in the buffer
cache.
 vacuum_freeze_min_age           | 50000000
                    | Minimum age at which VACUUM should freeze a
table row.
 vacuum_freeze_table_age         | 150000000
                    | Age at which VACUUM should scan whole table to
freeze tuples.
 wal_block_size                  | 8192
                    | Shows the block size in the write ahead log.
 wal_buffers                     | 64kB
                    | Sets the number of disk-page buffers in shared
memory for WAL.
 wal_segment_size                | 16MB
                    | Shows the number of pages per write ahead log
segment.
 wal_sync_method                 | fdatasync
                    | Selects the method used for forcing WAL updates
to disk.
 wal_writer_delay                | 200ms
                    | WAL writer sleep time between WAL flushes.
 work_mem                        | 6GB
                    | Sets the maximum memory to be used for query
workspaces.
 xmlbinary                       | base64
                    | Sets how binary values are to be encoded in XML.
 xmloption                       | content
                    | Sets whether XML data in implicit parsing and
serialization operations is to be considered as documents or content
fragments.
 zero_damaged_pages              | off
                    | Continues processing past damaged page headers.
(193 rows)

Re: Air-traffic benchmark

From
"Gurgel, Flavio"
Date:
Hello

----- "Lefteris" <lsidir@gmail.com> escreveu:
> Hi all,
>
> following the simple but interesting air-traffic benchmark published
> at:
> http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/

Quite interesting test, if you have the time to download all that raw data.

> of 5 minutes while the slow ones in the order of 40 minutes. I came
> to
> the conclusion that some parameters are not set correctly. I give all

I do think so too.

> The hardware characteristics are:
> Platform Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz with 8GB RAM and
> ample disk space (2x 500 GB SATA disk @ 7200 RPM as SW-RAID-0)

SATA disks are not the best for a "benchmark" like yours, but the results were so deeply different from expected that
I'msure that you have hardware enough. 

> I used  postgresql-8.4.2. Source download and compiled by hand, no
> special parameters where passed to the configure phase.
>
> -- DETAILS on loading, analyze and index creation on postgresql-8.4.2
> -- loading time was 77m15.976s
> -- total rows: 119790558
> -- total size of pgdata/base: 46G
> ANALYZE;
> -- 219698.365ms aka 3m39.698s
> CREATE INDEX year_month_idx ON ontime ("Year", "Month");
> -- 517481.632ms aka 8m37.481s
> CREATE INDEX DepDelay_idx ON ontime ("DepDelay");
> -- 1257051.668ms aka 20m57.051s

Here comes my first question:
Did you ANALYZE your database (or at least related tables) _after_ index creation?
If not, try to do so. PostgreSQL needs statistics of the database when everything is in its place.

> airtraffic=# EXPLAIN ANALYZE SELECT "DayOfWeek", count(*) AS c FROM
> ontime WHERE "Year" BETWEEN 2000 AND 2008 GROUP BY "DayOfWeek" ORDER
> BY c DESC;
>                                                                QUERY
> PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=7407754.12..7407754.13 rows=4 width=2) (actual
> time=371188.821..371188.823 rows=7 loops=1)
>   Sort Key: (count(*))
>   Sort Method:  quicksort  Memory: 25kB
>   ->  HashAggregate  (cost=7407754.04..7407754.08 rows=4 width=2)
> (actual time=371163.316..371163.320 rows=7 loops=1)
>         ->  Seq Scan on ontime  (cost=0.00..7143875.40 rows=52775727
> width=2) (actual time=190938.959..346180.079 rows=52484047 loops=1)
>               Filter: (("Year" >= 2000) AND ("Year" <= 2008))
>  Total runtime: 371201.156 ms
> (7 rows)

You'll see here that PostgreSQL is not using the index you just created.
ANALYZE VERBOSE ontime;
should solve this.

> I understand that the problem here is the memory used by the sort
> method. *But*, I already changed the work_mem parameter to 6gigs:)

If you look to you explain you'll see that you don't need that much of memory.
You have 8GB of total RAM, if you use that much for sorting you'll start to swap.

> which means that of course work_mem == sort_mem, as
> such, shouldn't be the case that the sort algorithm should have used
> much more memory?

sort_mem is a part of the work_mem in recent versions of PostgreSQL.
No, the sort algorithm doesn't need that at all.

> I also attach the output of 'show all;' so you can advice me in any
> other configuration settings that I might need to change to perform
> better.

Let's take a look.

>  geqo                            | on

If in doubt, turn this off. Geqo is capable of making bad execution plans for you.
But this is not the most important to change.

>  shared_buffers                  | 32MB

Here it is. The default setting for shared_buffer doesn't give space for the buffercache.
I would recomend you to increase this to, at least, 40% of your total RAM.
Don't forget to restart PostgreSQL after changing here, and it's possible that you'll need to increase some system V
parametersin Fedora. Read PostgreSQL documentation about it. 

>  effective_cache_size            | 8MB

Increase here to, at least, shared_buffers + any caches that you have in your hardware (e.g. in the RAID controller)

>  wal_buffers                     | 64kB

Increasing here can make your bulk load faster. 8MB is a good number.
This will not make your SELECT queries faster.

>  wal_sync_method                 | fdatasync

fdatasync is the recommended method for Solaris. I would use open_sync. It's not important for SELECT too.

>  work_mem                        | 6GB
Start here with 10MB. If you have temp files when executing your SELECTs try to increase just a bit.

Let us know what happens in your new tests.
Best regards

Flavio Henrique A. Gurgel
tel. 55-11-2125.4765
fax. 55-11-2125.4777
www.4linux.com.br
FREE SOFTWARE SOLUTIONS

Re: Air-traffic benchmark

From
"A. Kretschmer"
Date:
In response to Lefteris :
>
> airtraffic=# EXPLAIN ANALYZE SELECT "DayOfWeek", count(*) AS c FROM
> ontime WHERE "Year" BETWEEN 2000 AND 2008 GROUP BY "DayOfWeek" ORDER
> BY c DESC;
>                                                                QUERY
> PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=7407754.12..7407754.13 rows=4 width=2) (actual
> time=371188.821..371188.823 rows=7 loops=1)
>   Sort Key: (count(*))
>   Sort Method:  quicksort  Memory: 25kB
>   ->  HashAggregate  (cost=7407754.04..7407754.08 rows=4 width=2)
> (actual time=371163.316..371163.320 rows=7 loops=1)
>         ->  Seq Scan on ontime  (cost=0.00..7143875.40 rows=52775727
> width=2) (actual time=190938.959..346180.079 rows=52484047 loops=1)
>               Filter: (("Year" >= 2000) AND ("Year" <= 2008))
>  Total runtime: 371201.156 ms
> (7 rows)
>
>
> I understand that the problem here is the memory used by the sort
> method. *But*, I already changed the work_mem parameter to 6gigs:)

No.

The problem here is the Seq-Scan. It returns about 52.000.000 rows,
approximately roughly table, it needs 346 seconds.

The sort needs only 25 KByte and only 0.02ms.


Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431  2EB0 389D 1DC2 3172 0C99

Re: Air-traffic benchmark

From
Lefteris
Date:
Thank you all for your answers!

Andrea, I see the other way around what you are saying:

Sort  (cost=7407754.12..7407754.13 rows=4 width=2) (actual
time=371188.821..371188.823 rows=7 loops=1)
Seq Scan on ontime  (cost=0.00..7143875.40 rows=52775727 width=2)
(actual time=190938.959..346180.079 rows=52484047 loops=1)


I dont see the seq scan to ba a problem, and it is the correct choice
here because Year spans from 1999 to 2009 and the query asks from 2000
and on, so PG correctly decides to use seq scan and not index access.

lefteris

On Thu, Jan 7, 2010 at 2:32 PM, A. Kretschmer
<andreas.kretschmer@schollglas.com> wrote:
> In response to Lefteris :
>>
>> airtraffic=# EXPLAIN ANALYZE SELECT "DayOfWeek", count(*) AS c FROM
>> ontime WHERE "Year" BETWEEN 2000 AND 2008 GROUP BY "DayOfWeek" ORDER
>> BY c DESC;
>>                                                                QUERY
>> PLAN
>>
------------------------------------------------------------------------------------------------------------------------------------------
>>  Sort  (cost=7407754.12..7407754.13 rows=4 width=2) (actual
>> time=371188.821..371188.823 rows=7 loops=1)
>>   Sort Key: (count(*))
>>   Sort Method:  quicksort  Memory: 25kB
>>   ->  HashAggregate  (cost=7407754.04..7407754.08 rows=4 width=2)
>> (actual time=371163.316..371163.320 rows=7 loops=1)
>>         ->  Seq Scan on ontime  (cost=0.00..7143875.40 rows=52775727
>> width=2) (actual time=190938.959..346180.079 rows=52484047 loops=1)
>>               Filter: (("Year" >= 2000) AND ("Year" <= 2008))
>>  Total runtime: 371201.156 ms
>> (7 rows)
>>
>>
>> I understand that the problem here is the memory used by the sort
>> method. *But*, I already changed the work_mem parameter to 6gigs:)
>
> No.
>
> The problem here is the Seq-Scan. It returns about 52.000.000 rows,
> approximately roughly table, it needs 346 seconds.
>
> The sort needs only 25 KByte and only 0.02ms.
>
>
> Andreas
> --
> Andreas Kretschmer
> Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
> GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431  2EB0 389D 1DC2 3172 0C99
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: Air-traffic benchmark

From
Arjen van der Meijden
Date:
On 7-1-2010 13:38 Lefteris wrote:
> I decided to run the benchmark over postgres to get some more
> experience and insights. Unfortunately, the query times I got from
> postgres were not the expected ones:

Why were they not expected? In the given scenario, column databases are
having a huge advantage. Especially the given simple example is the type
of query a column database *should* excel.
You should, at the very least, compare the queries to MyISAM:
http://www.mysqlperformanceblog.com/2009/11/05/air-traffic-queries-in-myisam-and-tokutek-tokudb/

But unfortunately, that one also beats your postgresql-results.

> The hardware characteristics are:
> Platform Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz with 8GB RAM and
> ample disk space (2x 500 GB SATA disk @ 7200 RPM as SW-RAID-0)

Unfortunately, the blogpost fails to mention the disk-subsystem. So it
may well be much faster than yours, although its not a new, big or fast
server, so unless it has external storage, it shouldn't be too different
for sequential scans.

> SELECT "DayOfWeek", count(*) AS c FROM ontime WHERE "Year" BETWEEN
> 2000 AND 2008 GROUP BY "DayOfWeek" ORDER BY c DESC;
>
> Reported query times are (in sec):
> MonetDB 7.9s
> InfoBright 12.13s
> LucidDB 54.8s
>
> For pg-8.4.2  I got with 3 consecutive runs on the server:
> 5m52.384s
> 5m55.885s
> 5m54.309s

Maybe an index of the type 'year, dayofweek' will help for this query.
But it'll have to scan about half the table any way, so a seq scan isn't
a bad idea.
In this case, a partitioned table with partitions per year and
constraint exclusion enabled would help a bit more.

Best regards,

Arjen

Re: Air-traffic benchmark

From
Lefteris
Date:
Hi Arjen,

so I understand from all of you that you don't consider the use of 25k
for sorting to be the cause of the slowdown? Probably I am missing
something on the specific sort algorithm used by PG. My RAM does fill
up, mainly by file buffers from linux, but postgres process remains to
0.1% consumption of main memory. There is no way to force sort to use
say blocks of 128MB ? wouldn't that make a difference?

lefteris

p.s. i already started the analyze verbose again as Flavio suggested
and reset the parrameters, although I think some of Flavioo's
suggestions have to do with multiple users/queries and not 1 long
running query, like shared_buffers, or not?

On Thu, Jan 7, 2010 at 2:36 PM, Arjen van der Meijden
<acmmailing@tweakers.net> wrote:
> On 7-1-2010 13:38 Lefteris wrote:
>>
>> I decided to run the benchmark over postgres to get some more
>> experience and insights. Unfortunately, the query times I got from
>> postgres were not the expected ones:
>
> Why were they not expected? In the given scenario, column databases are
> having a huge advantage. Especially the given simple example is the type of
> query a column database *should* excel.
> You should, at the very least, compare the queries to MyISAM:
> http://www.mysqlperformanceblog.com/2009/11/05/air-traffic-queries-in-myisam-and-tokutek-tokudb/
>
> But unfortunately, that one also beats your postgresql-results.
>
>> The hardware characteristics are:
>> Platform Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz with 8GB RAM and
>> ample disk space (2x 500 GB SATA disk @ 7200 RPM as SW-RAID-0)
>
> Unfortunately, the blogpost fails to mention the disk-subsystem. So it may
> well be much faster than yours, although its not a new, big or fast server,
> so unless it has external storage, it shouldn't be too different for
> sequential scans.
>
>> SELECT "DayOfWeek", count(*) AS c FROM ontime WHERE "Year" BETWEEN
>> 2000 AND 2008 GROUP BY "DayOfWeek" ORDER BY c DESC;
>>
>> Reported query times are (in sec):
>> MonetDB 7.9s
>> InfoBright 12.13s
>> LucidDB 54.8s
>>
>> For pg-8.4.2  I got with 3 consecutive runs on the server:
>> 5m52.384s
>> 5m55.885s
>> 5m54.309s
>
> Maybe an index of the type 'year, dayofweek' will help for this query. But
> it'll have to scan about half the table any way, so a seq scan isn't a bad
> idea.
> In this case, a partitioned table with partitions per year and constraint
> exclusion enabled would help a bit more.
>
> Best regards,
>
> Arjen
>

Re: Air-traffic benchmark

From
"A. Kretschmer"
Date:
In response to Lefteris :
> Thank you all for your answers!
>
> Andrea, I see the other way around what you are saying:
>
> Sort  (cost=7407754.12..7407754.13 rows=4 width=2) (actual
> time=371188.821..371188.823 rows=7 loops=1)
> Seq Scan on ontime  (cost=0.00..7143875.40 rows=52775727 width=2)
> (actual time=190938.959..346180.079 rows=52484047 loops=1)
>
>
> I dont see the seq scan to ba a problem, and it is the correct choice
> here because Year spans from 1999 to 2009 and the query asks from 2000
> and on, so PG correctly decides to use seq scan and not index access.

Thats right.

But this is not a contradiction, this seq-scan *is* the real problem, not
the sort. And yes, as others said, increment the work_mem isn't the
solution. It is counterproductive, because you lost buffer-cache.


Andreas, note the 's' at the end ;-)
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431  2EB0 389D 1DC2 3172 0C99

Re: Air-traffic benchmark

From
"A. Kretschmer"
Date:
In response to Lefteris :
> Hi Arjen,
>
> so I understand from all of you that you don't consider the use of 25k
> for sorting to be the cause of the slowdown? Probably I am missing
> something on the specific sort algorithm used by PG. My RAM does fill
> up, mainly by file buffers from linux, but postgres process remains to
> 0.1% consumption of main memory. There is no way to force sort to use
> say blocks of 128MB ? wouldn't that make a difference?

The result-table fits in that piece of memory.

You have only this little table:

[ 5, 7509643 ]
[ 1, 7478969 ]
[ 4, 7453687 ]
[ 3, 7412939 ]
[ 2, 7370368 ]
[ 7, 7095198 ]
[ 6, 6425690 ]


Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431  2EB0 389D 1DC2 3172 0C99

Re: Air-traffic benchmark

From
Lefteris
Date:
Yes, I am reading the plan wrong! I thought that each row from the
plan reported the total time for the operation but it actually reports
the starting and ending point.

So we all agree that the problem is on the scans:)

So the next question is why changing shared memory buffers will fix
that? i only have one session with one connection, do I have like many
reader workers or something?

Thank you and sorry for the plethora of questions, but I know few
about the inner parts of postgres:)

lefteris

On Thu, Jan 7, 2010 at 3:05 PM, Jochen Erwied
<jochen@pgsql-performance.erwied.eu> wrote:
> Thursday, January 7, 2010, 2:47:36 PM you wrote:
>
>> so I understand from all of you that you don't consider the use of 25k
>> for sorting to be the cause of the slowdown? Probably I am missing
>
> Maybe you are reading the plan wrong:
>
> - the sort needs only 25kB of memory, and finishes in sub-second time,
>  mainly because the sort only sorts the already summarized data, and not
>  the whole table
> - the sequential scan takes 346 seconds, and thus is the major factor in
>  time to finish!
>
> So the total query time is 371 seconds, of which 346 are required to
> completely scan the table once.
>
> --
> Jochen Erwied     |   home: jochen@erwied.eu     +49-208-38800-18, FAX: -19
> Sauerbruchstr. 17 |   work: joe@mbs-software.de  +49-2151-7294-24, FAX: -50
> D-45470 Muelheim  | mobile: jochen.erwied@vodafone.de       +49-173-5404164
>
>

Re: Air-traffic benchmark

From
Alvaro Herrera
Date:
Lefteris escribió:
> Yes, I am reading the plan wrong! I thought that each row from the
> plan reported the total time for the operation but it actually reports
> the starting and ending point.
>
> So we all agree that the problem is on the scans:)
>
> So the next question is why changing shared memory buffers will fix
> that? i only have one session with one connection, do I have like many
> reader workers or something?

No amount of tinkering is going to change the fact that a seqscan is the
fastest way to execute these queries.  Even if you got it to be all in
memory, it would still be much slower than the other systems which, I
gather, are using columnar storage and thus are perfectly suited to this
problem (unlike Postgres).  The talk about "compression ratios" caught
me by surprise until I realized it was columnar stuff.  There's no way
you can get such high ratios on a regular, row-oriented storage.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Air-traffic benchmark

From
Alvaro Herrera
Date:
Alvaro Herrera escribió:

> No amount of tinkering is going to change the fact that a seqscan is the
> fastest way to execute these queries.  Even if you got it to be all in
> memory, it would still be much slower than the other systems which, I
> gather, are using columnar storage and thus are perfectly suited to this
> problem (unlike Postgres).  The talk about "compression ratios" caught
> me by surprise until I realized it was columnar stuff.  There's no way
> you can get such high ratios on a regular, row-oriented storage.

FWIW if you want a fair comparison, get InnoDB numbers.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Air-traffic benchmark

From
Lefteris
Date:
On Thu, Jan 7, 2010 at 3:14 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Lefteris escribió:
>> Yes, I am reading the plan wrong! I thought that each row from the
>> plan reported the total time for the operation but it actually reports
>> the starting and ending point.
>>
>> So we all agree that the problem is on the scans:)
>>
>> So the next question is why changing shared memory buffers will fix
>> that? i only have one session with one connection, do I have like many
>> reader workers or something?
>
> No amount of tinkering is going to change the fact that a seqscan is the
> fastest way to execute these queries.  Even if you got it to be all in
> memory, it would still be much slower than the other systems which, I
> gather, are using columnar storage and thus are perfectly suited to this
> problem (unlike Postgres).  The talk about "compression ratios" caught
> me by surprise until I realized it was columnar stuff.  There's no way
> you can get such high ratios on a regular, row-oriented storage.
>
> --
> Alvaro Herrera                                http://www.CommandPrompt.com/
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>

I am aware of that and I totally agree. I would not expect from a row
store to have the same performance with a column. I was just trying to
double check that all settings are correct because usually you have
difference of seconds and minutes between column-rows, not seconds and
almost an hour (for queries Q2-Q8).

I think what you all said was very helpful and clear! The only part
that I still disagree/don't understand is the shared_buffer option:))

Lefteris

Re: Air-traffic benchmark

From
Jochen Erwied
Date:
Thursday, January 7, 2010, 2:47:36 PM you wrote:

> so I understand from all of you that you don't consider the use of 25k
> for sorting to be the cause of the slowdown? Probably I am missing

Maybe you are reading the plan wrong:

- the sort needs only 25kB of memory, and finishes in sub-second time,
  mainly because the sort only sorts the already summarized data, and not
  the whole table
- the sequential scan takes 346 seconds, and thus is the major factor in
  time to finish!

So the total query time is 371 seconds, of which 346 are required to
completely scan the table once.

--
Jochen Erwied     |   home: jochen@erwied.eu     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: joe@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erwied@vodafone.de       +49-173-5404164


Re: Air-traffic benchmark

From
Ivan Voras
Date:
On 7.1.2010 15:23, Lefteris wrote:

> I think what you all said was very helpful and clear! The only part
> that I still disagree/don't understand is the shared_buffer option:))

Did you ever try increasing shared_buffers to what was suggested (around
4 GB) and see what happens (I didn't see it in your posts)?

Shared_buffers can be thought as the PostgreSQLs internal cache. If the
pages being scanned for a particular query are in the cache, this will
help performance very much on multiple exequtions of the same query.
OTOH, since the file system's cache didn't help you significantly, there
is low possibility shared_buffers will. It is still worth trying.

Re: Air-traffic benchmark

From
Lefteris
Date:
On Thu, Jan 7, 2010 at 3:51 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 7.1.2010 15:23, Lefteris wrote:
>
>> I think what you all said was very helpful and clear! The only part
>> that I still disagree/don't understand is the shared_buffer option:))
>
> Did you ever try increasing shared_buffers to what was suggested (around
> 4 GB) and see what happens (I didn't see it in your posts)?

No I did not to that yet, mainly because I need the admin of the
machine to change the shmmax of the kernel and also because I have no
multiple queries running. Does Seq scan uses shared_buffers?

>
> Shared_buffers can be thought as the PostgreSQLs internal cache. If the
> pages being scanned for a particular query are in the cache, this will
> help performance very much on multiple exequtions of the same query.
> OTOH, since the file system's cache didn't help you significantly, there
> is low possibility shared_buffers will. It is still worth trying.
>
> From the description of the data ("...from years 1988 to 2009...") it
> looks like the query for "between 2000 and 2009" pulls out about half of
> the data. If an index could be used instead of seqscan, it could be
> perhaps only 50% faster, which is still not very comparable to others.
>
> The table is very wide, which is probably why the tested databases can
> deal with it faster than PG. You could try and narrow the table down
> (for instance: remove the Div* fields) to make the data more
> "relational-like". In real life, speedups in this circumstances would
> probably be gained by normalizing the data to make the basic table
> smaller and easier to use with indexing.
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: Air-traffic benchmark

From
Grzegorz Jaśkiewicz
Date:
On Thu, Jan 7, 2010 at 3:05 PM, Lefteris <lsidir@gmail.com> wrote:
> On Thu, Jan 7, 2010 at 3:51 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>> On 7.1.2010 15:23, Lefteris wrote:
>>
>>> I think what you all said was very helpful and clear! The only part
>>> that I still disagree/don't understand is the shared_buffer option:))
>>
>> Did you ever try increasing shared_buffers to what was suggested (around
>> 4 GB) and see what happens (I didn't see it in your posts)?
>
> No I did not to that yet, mainly because I need the admin of the
> machine to change the shmmax of the kernel and also because I have no
> multiple queries running. Does Seq scan uses shared_buffers?
Think of the shared buffer as a cache. It will help subsequent queries
running to not have to use disk.


--
GJ

Re: Air-traffic benchmark

From
"Gurgel, Flavio"
Date:
----- "Lefteris" <lsidir@gmail.com> escreveu:
> > Did you ever try increasing shared_buffers to what was suggested
> (around
> > 4 GB) and see what happens (I didn't see it in your posts)?
>
> No I did not to that yet, mainly because I need the admin of the
> machine to change the shmmax of the kernel and also because I have no
> multiple queries running. Does Seq scan uses shared_buffers?

Having multiple queries running is *not* the only reason you need lots of shared_buffers.
Think of shared_buffers as a page cache, data in PostgreSQL is organized in pages.
If one single query execution had a step that brought a page to the buffercache, it's enough to increase another step
speedand change the execution plan, since the data access in memory is (usually) faster then disk. 

> > help performance very much on multiple exequtions of the same
> query.

This is also true.
This kind of test should, and will, give different results in subsequent executions.

> > From the description of the data ("...from years 1988 to 2009...")
> it
> > looks like the query for "between 2000 and 2009" pulls out about
> half of
> > the data. If an index could be used instead of seqscan, it could be
> > perhaps only 50% faster, which is still not very comparable to
> others.

The use of the index over seqscan has to be tested. I don't agree in 50% gain, since simple integers stored on B-Tree
havea huge possibility of beeing retrieved in the required order, and the discarded data will be discarder quickly too,
sothe gain has to be measured.  

I bet that an index scan will be a lot faster, but it's just a bet :)

> > The table is very wide, which is probably why the tested databases
> can
> > deal with it faster than PG. You could try and narrow the table
> down
> > (for instance: remove the Div* fields) to make the data more
> > "relational-like". In real life, speedups in this circumstances
> would
> > probably be gained by normalizing the data to make the basic table
> > smaller and easier to use with indexing.

Ugh. I don't think so. That's why indexes were invented. PostgreSQL is smart enough to "jump" over columns using byte
offsets.
A better option for this table is to partition it in year (or year/month) chunks.

45GB is not a so huge table compared to other ones I have seen before. I have systems where each partition is like 10
or20GB and data is very fast to access even whith aggregation queries. 

Flavio Henrique A. Gurgel
tel. 55-11-2125.4765
fax. 55-11-2125.4777
www.4linux.com.br
FREE SOFTWARE SOLUTIONS

Re: Air-traffic benchmark

From
Ivan Voras
Date:
2010/1/7 Lefteris <lsidir@gmail.com>:
> On Thu, Jan 7, 2010 at 3:51 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>> On 7.1.2010 15:23, Lefteris wrote:
>>
>>> I think what you all said was very helpful and clear! The only part
>>> that I still disagree/don't understand is the shared_buffer option:))
>>
>> Did you ever try increasing shared_buffers to what was suggested (around
>> 4 GB) and see what happens (I didn't see it in your posts)?
>
> No I did not to that yet, mainly because I need the admin of the
> machine to change the shmmax of the kernel and also because I have no
> multiple queries running. Does Seq scan uses shared_buffers?

Everything uses shared_buffers, even things that do not benefit from
it. This is because shared_buffers is the part of the general database
IO - it's unavoidable.

Re: Air-traffic benchmark

From
Lefteris
Date:
On Thu, Jan 7, 2010 at 4:57 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> 2010/1/7 Lefteris <lsidir@gmail.com>:
>> On Thu, Jan 7, 2010 at 3:51 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>>> On 7.1.2010 15:23, Lefteris wrote:
>>>
>>>> I think what you all said was very helpful and clear! The only part
>>>> that I still disagree/don't understand is the shared_buffer option:))
>>>
>>> Did you ever try increasing shared_buffers to what was suggested (around
>>> 4 GB) and see what happens (I didn't see it in your posts)?
>>
>> No I did not to that yet, mainly because I need the admin of the
>> machine to change the shmmax of the kernel and also because I have no
>> multiple queries running. Does Seq scan uses shared_buffers?
>
> Everything uses shared_buffers, even things that do not benefit from
> it. This is because shared_buffers is the part of the general database
> IO - it's unavoidable.
>


I will increase the shared_buffers once my kernel is configured and I
will report back to you.

As for the index scan, I already build an b-tree on year/month but PG
(correctly) decides not to use it. The data are from year 1999 up to
2009 (my typo mistake) so it is almost 90% of the data to be accessed.
When I ask for a specific year, like 2004 then the index is used and
query times become faster.

lefteris

Re: Air-traffic benchmark

From
Matthew Wakeling
Date:
On Thu, 7 Jan 2010, Gurgel, Flavio wrote:
> If one single query execution had a step that brought a page to the
> buffercache, it's enough to increase another step speed and change the
> execution plan, since the data access in memory is (usually) faster then
> disk.

Postgres does not change a query plan according to the shared_buffers
setting. It does not anticipate one step contributing to another step in
this way. It does however make use of the effective_cache_size setting to
estimate this effect, and that does affect the planner.

> The use of the index over seqscan has to be tested. I don't agree in 50%
> gain, since simple integers stored on B-Tree have a huge possibility of
> beeing retrieved in the required order, and the discarded data will be
> discarder quickly too, so the gain has to be measured.
>
> I bet that an index scan will be a lot faster, but it's just a bet :)

In a situation like this, the opposite will be true. If you were accessing
a very small part of a table, say to order by a field with a small limit,
then an index can be very useful by providing the results in the correct
order. However, in this case, almost the entire table has to be read.
Changing the order in which it is read will mean that the disc access is
no longer sequential, which will slow things down, not speed them up.
The Postgres planner isn't stupid (mostly), there is probably a good
reason why it isn't using an index scan.

>> The table is very wide, which is probably why the tested databases can
>> deal with it faster than PG. You could try and narrow the table down
>> (for instance: remove the Div* fields) to make the data more
>> "relational-like". In real life, speedups in this circumstances would
>> probably be gained by normalizing the data to make the basic table
>> smaller and easier to use with indexing.
>
> Ugh. I don't think so. That's why indexes were invented. PostgreSQL is
> smart enough to "jump" over columns using byte offsets.
> A better option for this table is to partition it in year (or year/month) chunks.

Postgres (mostly) stores the columns for a row together with a row, so
what you say is completely wrong. Postgres does not "jump" over columns
using byte offsets in this way. The index references a row in a page on
disc, and that page is fetched separately in order to retrieve the row.
The expensive part is physically moving the disc head to the right part of
the disc in order to fetch the correct page from the disc - jumping over
columns will not help with that at all.

Reducing the width of the table will greatly improve the performance of a
sequential scan, as it will reduce the size of the table on disc, and
therefore the time taken to read the entire table sequentially.

Moreover, your suggestion of partitioning the table may not help much with
this query. It will turn a single sequential scan into a UNION of many
tables, which may be harder for the planner to plan. Also, for queries
that access small parts of the table, indexes will help more than
partitioning will.

Partitioning will help most in the case where you want to summarise a
single year's data. Not really otherwise.

Matthew

--
 Q: What's the difference between ignorance and apathy?
 A: I don't know, and I don't care.

Re: Air-traffic benchmark

From
Craig James
Date:
Alvaro Herrera wrote:
> No amount of tinkering is going to change the fact that a seqscan is the
> fastest way to execute these queries.  Even if you got it to be all in
> memory, it would still be much slower than the other systems which, I
> gather, are using columnar storage and thus are perfectly suited to this
> problem (unlike Postgres).  The talk about "compression ratios" caught
> me by surprise until I realized it was columnar stuff.  There's no way
> you can get such high ratios on a regular, row-oriented storage.

One of the "good tricks" with Postgres is to convert a very wide table into a set of narrow tables, then use a view to
createsomething that looks like the original table.  It requires you to modify the write portions of your app, but the
readportions can stay the same. 

A seq scan on one column will *much* faster when you rearrange your database this way since it's only scanning relevant
data. You pay the price of an extra join on primary keys, though. 

If you have just a few columns in a very wide table that are seq-scanned a lot, you can pull out just those columns and
leavethe rest in the wide table. 

The same trick is also useful if you have one or a few columns that are updated frequently: pull them out, and use a
viewto recreate the original appearance.  It saves a lot on garbage collection. 

Craig


Re: Air-traffic benchmark

From
"Gurgel, Flavio"
Date:
----- "Matthew Wakeling" <matthew@flymine.org> escreveu:
> On Thu, 7 Jan 2010, Gurgel, Flavio wrote:
> Postgres does not change a query plan according to the shared_buffers
>
> setting. It does not anticipate one step contributing to another step
> in
> this way. It does however make use of the effective_cache_size setting
> to
> estimate this effect, and that does affect the planner.

That was what I was trying to say :)

> In a situation like this, the opposite will be true. If you were
> accessing
> a very small part of a table, say to order by a field with a small
> limit,
> then an index can be very useful by providing the results in the
> correct
> order. However, in this case, almost the entire table has to be read.
>
> Changing the order in which it is read will mean that the disc access
> is
> no longer sequential, which will slow things down, not speed them up.
>
> The Postgres planner isn't stupid (mostly), there is probably a good
> reason why it isn't using an index scan.

Sorry but I disagree. This is the typical case where the test has to be made.
The results are partial, let's say 50% of the table. Considerind that the disk is fast enough, the cost estimation of
sequentialand random reads are in a proportion of 1 to 4, considering default settings in PostgreSQL. 

If, and only if the data is time distributed in the table (which can be this case during bulk load) there will be some
gainin seqscan. 
If, let's say 50% of the 50% (25% of the data) is time distributed (which can be the case in most data warehouses), the
costof random reads * number of reads can be cheaper then seqscan. 

The volume of data doesn't turn the index generations so deep, let's say 2^7 or 2^8. This can lead to very fast data
retrieval.

> > Ugh. I don't think so. That's why indexes were invented. PostgreSQL
> is
> > smart enough to "jump" over columns using byte offsets.
> > A better option for this table is to partition it in year (or
> year/month) chunks.
>
> Postgres (mostly) stores the columns for a row together with a row, so
>
> what you say is completely wrong. Postgres does not "jump" over
> columns
> using byte offsets in this way. The index references a row in a page
> on
> disc, and that page is fetched separately in order to retrieve the
> row.
> The expensive part is physically moving the disc head to the right
> part of
> the disc in order to fetch the correct page from the disc - jumping
> over
> columns will not help with that at all.

If the index point to the right pages, I keep the circumstance of 1 to 4 cost.
Agreed about seqscans. When I talked about byte offsets I was talking of data in the same disk page, and this does not
helpI/O reduction at all. 

> Reducing the width of the table will greatly improve the performance
> of a
> sequential scan, as it will reduce the size of the table on disc, and
>
> therefore the time taken to read the entire table sequentially.

I just don't understand if you're seeing this situation as OLTP or DW, sorry.
DW tables are usually wider then OLTP.

> Moreover, your suggestion of partitioning the table may not help much
> with
> this query. It will turn a single sequential scan into a UNION of many
>
> tables, which may be harder for the planner to plan. Also, for queries

Partitioned plans are a collection of an independent plan for each table in the inheritance.
If the data to be retrieved is confined in selected partitions, you won't seqscan the partitions you don't need.
The cost of the "union" of the aggregations in memory is a lot cheaper then the avoided seqscans.

I have at least 3 cases of partitioning in queries exactly like this that droped from 3min to 5s execution times.
All of that DW tables, with aggregation and huge seqscans.

I keep my word that the right use of indexes here has to be tested.

Flavio Henrique A. Gurgel
tel. 55-11-2125.4786
cel. 55-11-8389.7635
www.4linux.com.br
FREE SOFTWARE SOLUTIONS

Re: Air-traffic benchmark

From
Scott Marlowe
Date:
On Thu, Jan 7, 2010 at 10:57 AM, Gurgel, Flavio <flavio@4linux.com.br> wrote:
> ----- "Matthew Wakeling" <matthew@flymine.org> escreveu:
>> On Thu, 7 Jan 2010, Gurgel, Flavio wrote:
>> Postgres does not change a query plan according to the shared_buffers
>>
>> setting. It does not anticipate one step contributing to another step
>> in
>> this way. It does however make use of the effective_cache_size setting
>> to
>> estimate this effect, and that does affect the planner.
>
> That was what I was trying to say :)
>
>> In a situation like this, the opposite will be true. If you were
>> accessing
>> a very small part of a table, say to order by a field with a small
>> limit,
>> then an index can be very useful by providing the results in the
>> correct
>> order. However, in this case, almost the entire table has to be read.
>>
>> Changing the order in which it is read will mean that the disc access
>> is
>> no longer sequential, which will slow things down, not speed them up.
>>
>> The Postgres planner isn't stupid (mostly), there is probably a good
>> reason why it isn't using an index scan.
>
> Sorry but I disagree. This is the typical case where the test has to be made.
> The results are partial, let's say 50% of the table. Considerind that the disk is fast enough, the cost estimation of
sequentialand random reads are in a proportion of 1 to 4, considering default settings in PostgreSQL. 

You do know that indexes in postgresql are not "covering" right?  I.e.
after hitting the index, the db then has to hit the table to see if
those rows are in fact visible.  So there's no such thing in pgsql, at
the moment, as an index only scan.

Re: Air-traffic benchmark

From
"Gurgel, Flavio"
Date:
----- "Scott Marlowe" <scott.marlowe@gmail.com> escreveu:

> You do know that indexes in postgresql are not "covering" right?
> I.e.
> after hitting the index, the db then has to hit the table to see if
> those rows are in fact visible.  So there's no such thing in pgsql,
> at
> the moment, as an index only scan.

That was just an estimation of effort to reach a tuple through seqscan X indexscan. In both cases the tuple have to be
checked,sure. 

Flavio Henrique A. Gurgel
tel. 55-11-2125.4786
cel. 55-11-8389.7635
www.4linux.com.br
FREE SOFTWARE SOLUTIONS

Re: Air-traffic benchmark

From
Nikolas Everett
Date:
This table is totally unnormalized.  Normalize it and try again.  You'll probably see a huge speedup.  Maybe even 10x.  My mantra has always been less data stored means less data to scan means faster scans.

On Thu, Jan 7, 2010 at 12:57 PM, Gurgel, Flavio <flavio@4linux.com.br> wrote:
----- "Matthew Wakeling" <matthew@flymine.org> escreveu:
> On Thu, 7 Jan 2010, Gurgel, Flavio wrote:
> Postgres does not change a query plan according to the shared_buffers
>
> setting. It does not anticipate one step contributing to another step
> in
> this way. It does however make use of the effective_cache_size setting
> to
> estimate this effect, and that does affect the planner.

That was what I was trying to say :)

> In a situation like this, the opposite will be true. If you were
> accessing
> a very small part of a table, say to order by a field with a small
> limit,
> then an index can be very useful by providing the results in the
> correct
> order. However, in this case, almost the entire table has to be read.
>
> Changing the order in which it is read will mean that the disc access
> is
> no longer sequential, which will slow things down, not speed them up.
>
> The Postgres planner isn't stupid (mostly), there is probably a good
> reason why it isn't using an index scan.

Sorry but I disagree. This is the typical case where the test has to be made.
The results are partial, let's say 50% of the table. Considerind that the disk is fast enough, the cost estimation of sequential and random reads are in a proportion of 1 to 4, considering default settings in PostgreSQL.

If, and only if the data is time distributed in the table (which can be this case during bulk load) there will be some gain in seqscan.
If, let's say 50% of the 50% (25% of the data) is time distributed (which can be the case in most data warehouses), the cost of random reads * number of reads can be cheaper then seqscan.

The volume of data doesn't turn the index generations so deep, let's say 2^7 or 2^8. This can lead to very fast data retrieval.

> > Ugh. I don't think so. That's why indexes were invented. PostgreSQL
> is
> > smart enough to "jump" over columns using byte offsets.
> > A better option for this table is to partition it in year (or
> year/month) chunks.
>
> Postgres (mostly) stores the columns for a row together with a row, so
>
> what you say is completely wrong. Postgres does not "jump" over
> columns
> using byte offsets in this way. The index references a row in a page
> on
> disc, and that page is fetched separately in order to retrieve the
> row.
> The expensive part is physically moving the disc head to the right
> part of
> the disc in order to fetch the correct page from the disc - jumping
> over
> columns will not help with that at all.

If the index point to the right pages, I keep the circumstance of 1 to 4 cost.
Agreed about seqscans. When I talked about byte offsets I was talking of data in the same disk page, and this does not help I/O reduction at all.

> Reducing the width of the table will greatly improve the performance
> of a
> sequential scan, as it will reduce the size of the table on disc, and
>
> therefore the time taken to read the entire table sequentially.

I just don't understand if you're seeing this situation as OLTP or DW, sorry.
DW tables are usually wider then OLTP.

> Moreover, your suggestion of partitioning the table may not help much
> with
> this query. It will turn a single sequential scan into a UNION of many
>
> tables, which may be harder for the planner to plan. Also, for queries

Partitioned plans are a collection of an independent plan for each table in the inheritance.
If the data to be retrieved is confined in selected partitions, you won't seqscan the partitions you don't need.
The cost of the "union" of the aggregations in memory is a lot cheaper then the avoided seqscans.

I have at least 3 cases of partitioning in queries exactly like this that droped from 3min to 5s execution times.
All of that DW tables, with aggregation and huge seqscans.

I keep my word that the right use of indexes here has to be tested.

Flavio Henrique A. Gurgel
tel. 55-11-2125.4786
cel. 55-11-8389.7635
www.4linux.com.br
FREE SOFTWARE SOLUTIONS

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Air-traffic benchmark

From
Greg Smith
Date:
Lefteris wrote:
> So we all agree that the problem is on the scans:)
>
> So the next question is why changing shared memory buffers will fix
> that? i only have one session with one connection, do I have like many
> reader workers or something?
>

I wouldn't expect it to.  Large sequential scans like this one are
optimized in PostgreSQL to only use up a small portion of the
shared_buffers cache.  Allocating more RAM to the database won't improve
the fact that you're spending the whole time waiting for physical I/O to
happen very much.

What might help is increasing effective_cache_size a lot though, because
there you might discover the database switching to all new sorts of
plans for some of these queries.  But, again, that doesn't impact the
situation where a sequential scan is the only approach.

I have this whole data set on my PC already and have been trying to find
time to get it loaded and start my own tests here, it is a quite
interesting set of information.  Can you tell me what you had to do in
order to get it running in PostgreSQL?  If you made any customizations
there, I'd like to get a copy of them.  Would save me some time and help
me get to where I could give suggestions out if I had a "pgdumpall
--schema-only" dump from your database for example, or however you got
the schema into there, and the set of PostgreSQL-compatible queries
you're using.

By the way:  if anybody else wants to join in, here's a script that
generates a script to download the whole data set:

#!/usr/bin/env python
for y in range(1988,2010):
    for m in range(1,13):
        print "wget --limit-rate=100k
http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_%s_%s.zip"
% (y,m)

It's 3.8GB of download that uncompresses into 46GB of CSV data, which is
why I put the rate limiter on there--kept it from clogging my entire
Internet connection.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: Air-traffic benchmark

From
Lefteris
Date:
Hi Greg,

thank you for your help. The changes I did on the dataset was just
removing the last comma from the CSV files as it was interpreted by pg
as an extra column. The schema I used, the load script and queries can
be found at:

http://homepages.cwi.nl/~lsidir/postgres/

(I understood that if I attach these files here, my email will not
reach the list so I give you a link to download them).

Also since you are interesting on the benchmark, you can also check

http://homepages.cwi.nl/~mk/ontimeReport

for a report of various experiments with MonetDB and comparison with
previously published numbers.

The schema I used for pg is slightly different from that one of
MonetDB since the delay fields could not be parsed by pg as integers
but only as varchar(4). Hence the extra index on DepDelay field:)

Also at http://homepages.cwi.nl/~lsidir/PostgreSQL-ontimeReport you
can see the detailed times I got from postgres.

I really appreciate your help! this is a great opportunity for me to
get some feeling and insights on postgres since I never had the chance
to use it in a large scale project.

lefteris


On Thu, Jan 7, 2010 at 11:21 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Lefteris wrote:
>>
>> So we all agree that the problem is on the scans:)
>>
>> So the next question is why changing shared memory buffers will fix
>> that? i only have one session with one connection, do I have like many
>> reader workers or something?
>>
>
> I wouldn't expect it to.  Large sequential scans like this one are optimized
> in PostgreSQL to only use up a small portion of the shared_buffers cache.
>  Allocating more RAM to the database won't improve the fact that you're
> spending the whole time waiting for physical I/O to happen very much.
>
> What might help is increasing effective_cache_size a lot though, because
> there you might discover the database switching to all new sorts of plans
> for some of these queries.  But, again, that doesn't impact the situation
> where a sequential scan is the only approach.
>
> I have this whole data set on my PC already and have been trying to find
> time to get it loaded and start my own tests here, it is a quite interesting
> set of information.  Can you tell me what you had to do in order to get it
> running in PostgreSQL?  If you made any customizations there, I'd like to
> get a copy of them.  Would save me some time and help me get to where I
> could give suggestions out if I had a "pgdumpall --schema-only" dump from
> your database for example, or however you got the schema into there, and the
> set of PostgreSQL-compatible queries you're using.
>
> By the way:  if anybody else wants to join in, here's a script that
> generates a script to download the whole data set:
>
> #!/usr/bin/env python
> for y in range(1988,2010):
>   for m in range(1,13):
>       print "wget --limit-rate=100k
> http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_%s_%s.zip"
> % (y,m)
>
> It's 3.8GB of download that uncompresses into 46GB of CSV data, which is why
> I put the rate limiter on there--kept it from clogging my entire Internet
> connection.
>
> --
> Greg Smith    2ndQuadrant   Baltimore, MD
> PostgreSQL Training, Services and Support
> greg@2ndQuadrant.com  www.2ndQuadrant.com
>
>

Re: Air-traffic benchmark

From
Lefteris
Date:
On Thu, Jan 7, 2010 at 11:57 PM, Lefteris <lsidir@gmail.com> wrote:
> Hi Greg,
>
> thank you for your help. The changes I did on the dataset was just
> removing the last comma from the CSV files as it was interpreted by pg
> as an extra column. The schema I used, the load script and queries can
> be found at:
>
> http://homepages.cwi.nl/~lsidir/postgres/
>
> (I understood that if I attach these files here, my email will not
> reach the list so I give you a link to download them).
>
> Also since you are interesting on the benchmark, you can also check
>
> http://homepages.cwi.nl/~mk/ontimeReport
>
> for a report of various experiments with MonetDB and comparison with
> previously published numbers.
>
> The schema I used for pg is slightly different from that one of
> MonetDB since the delay fields could not be parsed by pg as integers
> but only as varchar(4). Hence the extra index on DepDelay field:)

Sorry, I mean the ArrTime DepTime fields were changed, because they
apear on the data as HHMM, but these fields were not used on the
queries. The index on DepDelay was done for q3,4,5 and 7

>
> Also at http://homepages.cwi.nl/~lsidir/PostgreSQL-ontimeReport you
> can see the detailed times I got from postgres.
>
> I really appreciate your help! this is a great opportunity for me to
> get some feeling and insights on postgres since I never had the chance
> to use it in a large scale project.
>
> lefteris
>
>
> On Thu, Jan 7, 2010 at 11:21 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> Lefteris wrote:
>>>
>>> So we all agree that the problem is on the scans:)
>>>
>>> So the next question is why changing shared memory buffers will fix
>>> that? i only have one session with one connection, do I have like many
>>> reader workers or something?
>>>
>>
>> I wouldn't expect it to.  Large sequential scans like this one are optimized
>> in PostgreSQL to only use up a small portion of the shared_buffers cache.
>>  Allocating more RAM to the database won't improve the fact that you're
>> spending the whole time waiting for physical I/O to happen very much.
>>
>> What might help is increasing effective_cache_size a lot though, because
>> there you might discover the database switching to all new sorts of plans
>> for some of these queries.  But, again, that doesn't impact the situation
>> where a sequential scan is the only approach.
>>
>> I have this whole data set on my PC already and have been trying to find
>> time to get it loaded and start my own tests here, it is a quite interesting
>> set of information.  Can you tell me what you had to do in order to get it
>> running in PostgreSQL?  If you made any customizations there, I'd like to
>> get a copy of them.  Would save me some time and help me get to where I
>> could give suggestions out if I had a "pgdumpall --schema-only" dump from
>> your database for example, or however you got the schema into there, and the
>> set of PostgreSQL-compatible queries you're using.
>>
>> By the way:  if anybody else wants to join in, here's a script that
>> generates a script to download the whole data set:
>>
>> #!/usr/bin/env python
>> for y in range(1988,2010):
>>   for m in range(1,13):
>>       print "wget --limit-rate=100k
>> http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_%s_%s.zip"
>> % (y,m)
>>
>> It's 3.8GB of download that uncompresses into 46GB of CSV data, which is why
>> I put the rate limiter on there--kept it from clogging my entire Internet
>> connection.
>>
>> --
>> Greg Smith    2ndQuadrant   Baltimore, MD
>> PostgreSQL Training, Services and Support
>> greg@2ndQuadrant.com  www.2ndQuadrant.com
>>
>>
>

Re: Air-traffic benchmark

From
hubert depesz lubaczewski
Date:
On Thu, Jan 07, 2010 at 01:38:41PM +0100, Lefteris wrote:
> airtraffic=# EXPLAIN ANALYZE SELECT "DayOfWeek", count(*) AS c FROM
> ontime WHERE "Year" BETWEEN 2000 AND 2008 GROUP BY "DayOfWeek" ORDER
> BY c DESC;

Well, this query basically has to be slow. Correct approach to this
problem is to add precalculated aggregates - either with triggers or
with some cronjob.
Afterwards query speed depends only on how good are your aggregates,
and/or how detailed.
Of course calculating them is not free, but is done on write (or
periodically), and not on user-request, which makes user-requests *much*
faster.

depesz

--
Linkedin: http://www.linkedin.com/in/depesz  /  blog: http://www.depesz.com/
jid/gtalk: depesz@depesz.com / aim:depeszhdl / skype:depesz_hdl / gg:6749007

Re: Air-traffic benchmark

From
Greg Smith
Date:
hubert depesz lubaczewski wrote:
> Well, this query basically has to be slow. Correct approach to this
> problem is to add precalculated aggregates...

The point of this data set and associated queries is to see how fast the
database can do certain types of queries on its own.  Some other types
of database implementations automatically do well here due to column
storage and other techniques; the idea is to measure how you can do
without trying to rewrite the app any though.  If precomputed aggregates
were added, they'd improve performance for everybody else, too.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: Air-traffic benchmark

From
Craig Ringer
Date:
On 7/01/2010 11:45 PM, Gurgel, Flavio wrote:

>>> The table is very wide, which is probably why the tested databases
>> can
>>> deal with it faster than PG. You could try and narrow the table
>> down
>>> (for instance: remove the Div* fields) to make the data more
>>> "relational-like". In real life, speedups in this circumstances
>> would
>>> probably be gained by normalizing the data to make the basic table
>>> smaller and easier to use with indexing.
>
> Ugh. I don't think so. That's why indexes were invented. PostgreSQL is smart enough to "jump" over columns using byte
offsets.

Even if Pg tried to do so, it would generally not help. The cost of a
disk seek to the start of the next row would be much greater than the
cost of continuing to sequentially read until that point was reached.

With the amazing sequential read speeds and still mediocre seek speeds
modern disks it's rarely worth seeking over unwanted data that's less
than a megabyte or two in size.

Anyway, in practice the OS-level, array-level and/or disk-level
readahead would generally ensure that the data you were trying to skip
had already been read or was in the process of being read.

Can Pg even read partial records ? I thought it all operated on a page
level, where if an index indicates that a particular value is present on
a page the whole page gets read in and all records on the page are
checked for the value of interest. No?

--
Craig Ringer

Re: Air-traffic benchmark

From
Tom Lane
Date:
Craig Ringer <craig@postnewspapers.com.au> writes:
> Can Pg even read partial records ? I thought it all operated on a page
> level, where if an index indicates that a particular value is present on
> a page the whole page gets read in and all records on the page are
> checked for the value of interest. No?

The whole page gets read, but we do know which record on the page
the index entry is pointing at.

(This statement is an oversimplification, because of lossy indexes
and lossy bitmap scans, but most of the time it's not a matter of
"checking all records on the page".)

            regards, tom lane

Re: Air-traffic benchmark

From
Craig Ringer
Date:
On 8/01/2010 2:11 AM, Nikolas Everett wrote:
> This table is totally unnormalized.  Normalize it and try again.  You'll
> probably see a huge speedup.  Maybe even 10x.  My mantra has always been
> less data stored means less data to scan means faster scans.

Sometimes one intentionally denormalizes storage, though. JOIN costs can
be considerable too, and if most of the time you're interested in all
the data for a record not just a subset of it, storing it denormalized
is often faster and cheaper than JOINing for it or using subqueries to
fetch it.

Normalization or any other splitting of record into multiple separately
stored records also has costs in complexity, management, the need for
additional indexes, storage of foreign key references, all the extra
tuple headers you need to store, etc.

It's still generally the right thing to do, but it should be thought
about, not just tackled blindly. I only tend to view it as a no-brainer
if the alternative is storing numbered fields ("field0", "field1",
"field2", etc) ... and even then there are exceptions. One of my schema
at the moment has address_line_1 through address_line_4 in a `contact'
entity, and there's absolutely *no* way I'm splitting that into a
separate table of address_lines accessed by join and sort! (Arguably I
should be using a single `text' field with embedded newlines instead,
though).

Sometimes it's even better to hold your nose and embed an array in a
record rather than join to an external table. Purism can be taken too far.

Note that Pg's TOAST mechanism plays a part here, too. If you have a big
`text' field, it's probably going to get stored out-of-line (TOASTed)
anyway, and TOAST is going to be cleverer about fetching it than you
will be using a JOIN. So storing it in-line is likely to be the right
way to go. You can even force out-of-line storage if you're worried.

In the case of this benchmark, even if they split much of this data out
into other tables by reference, it's likely to be slower rather than
faster if they still want the data they've split out for most of their
queries.

--
Craig Ringer

Re: Air-traffic benchmark

From
Simon Riggs
Date:
On Thu, 2010-01-07 at 13:38 +0100, Lefteris wrote:
> Reported query times are (in sec):
> MonetDB 7.9s
> InfoBright 12.13s
> LucidDB 54.8s

It needs to be pointed out that those databases are specifically
optimised for Data Warehousing, whereas Postgres core is optimised for
concurrent write workloads in production systems.

If you want a best-vs-best type of comparison, you should be looking at
a version of Postgres optimised for Data Warehousing. These results show
that Postgres-related options exist that clearly beat the above numbers.
http://community.greenplum.com/showthread.php?t=111
I note also that Greenplum's Single Node Edition is now free to use, so
is a reasonable product for comparison on this list.

Also, I'm unimpressed by a Data Warehouse database that requires
everything to reside in memory, e.g. MonetDB. That severely limits
real-world usability, in my experience because it implies the queries
you're running aren't ad-hoc.

--
 Simon Riggs           www.2ndQuadrant.com