Thread: Are 50 million rows a problem for postgres ?
Hi all, i work in a telco and i have huge ammount of data, (50 million) but i see a lack of performance at huge tables with postgres, are 50 million rows the "limit" of postgres ? (with a good performance) i am waiting for 2004 2 billion records so i have to do something. Does anyone have a huge database to ask him some issues ? my hardware is good ,my indexes are good plz dont answer me something like use vacuum :) The Joy of Index Vasilis Ventirozos --------------------
Could you give more detailed information? What does explain say? On Mon, 8 Sep 2003, Vasilis Ventirozos wrote: > Hi all, i work in a telco and i have huge ammount of data, (50 million) > but i see a lack of performance at huge tables with postgres, > are 50 million rows the "limit" of postgres ? (with a good performance) > i am waiting for 2004 2 billion records so i have to do something. > Does anyone have a huge database to ask him some issues ? > > my hardware is good ,my indexes are good plz dont answer me something like use > vacuum :) > > > > The Joy of Index > > Vasilis Ventirozos > -------------------- > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend >
it's not a stadard statement, i am tring to get statistics for the company and i see a lack of performance (the same statement on informix runs good with the same indexes of course) Vasilis Ventirozos
On Mon, 8 Sep 2003, Vasilis Ventirozos wrote: > Hi all, i work in a telco and i have huge ammount of data, (50 million) > but i see a lack of performance at huge tables with postgres, > are 50 million rows the "limit" of postgres ? (with a good performance) > i am waiting for 2004 2 billion records so i have to do something. > Does anyone have a huge database to ask him some issues ? > > my hardware is good ,my indexes are good plz dont answer me something like use > vacuum :) I have a similarly huge number of records, as I process our web, ftp, and rsync logs together using postgres. Works like a charm. You do have to allow that queries are going to take a long time. I use about 6 queries to summarise a quarter's data - each run for each month, so a total of 18 queries. These run in a little over 24 hours. And there are many, many records per month. -- Sam Barnett-Cormack Software Developer | Student of Physics & Maths UK Mirror Service (http://www.mirror.ac.uk) | Lancaster University
> On Mon, 8 Sep 2003, Vasilis Ventirozos wrote: > > > Hi all, i work in a telco and i have huge ammount of data, (50 million) > > but i see a lack of performance at huge tables with postgres, > > are 50 million rows the "limit" of postgres ? (with a good performance) > > i am waiting for 2004 2 billion records so i have to do something. > > Does anyone have a huge database to ask him some issues ? > > > > my hardware is good ,my indexes are good plz dont answer me something like use > > vacuum :) > I did some performance testing back on PostgreSQL version 7.2 on a table of 350,000 records. My analysis at the time was that to access random records, performance deteriorated the further away the records that you were accessing were from the beginning of the index. For example using a query that had say OFFSET 250000 would cause large delays. On a text index these delays were in the order of 60 seconds! Which was unacceptable for the application I was developing. To overcome this problem I had to do away with queries that used OFFSET and developed queries that accessed the data relative to previous records I had accessed. The only draw back was that I had to make the indexes unique by appending a unique id column as the last column in the index. I now have no problem scanning a table in access of 1 million records in small chunks at a time. Anyway without seeing what sort of query you are having problems with nobody on these email lists will be able to fully help you. Minimum we need to see an SQL statement, and the results of EXPLAIN. Regards Donald Fraser.
This is a simple statement that i run core_netfon=# EXPLAIN select spcode,count(*) from callticket group by spcode; QUERY PLAN --------------------------------------------------------------------------------------- Aggregate (cost=2057275.91..2130712.22 rows=979151 width=4) -> Group (cost=2057275.91..2106233.45 rows=9791508 width=4) -> Sort (cost=2057275.91..2081754.68 rows=9791508 width=4) Sort Key: spcode -> Seq Scan on callticket (cost=0.00..424310.08 rows=9791508 width=4) (5 rows)
On Mon, 8 Sep 2003, Vasilis Ventirozos wrote: > This is a simple statement that i run > > core_netfon=# EXPLAIN select spcode,count(*) from callticket group by spcode; Well, yeah. Whatever you do, a complete seqscan and count is going to take a long time, in the order of hours rather than days I would expect. However, you may want to ensure that all tuning in postgresql.conf is correct, as it may not be using all possible resources. That will probably only make a small difference. -- Sam Barnett-Cormack Software Developer | Student of Physics & Maths UK Mirror Service (http://www.mirror.ac.uk) | Lancaster University
"Vasilis Ventirozos" <vendi@cosmoline.com> wrote: > This is a simple statement that i run > > core_netfon=# EXPLAIN select spcode,count(*) from callticket group by spcode; > QUERY PLAN > -------------------------------------------------------------------------- ------------- > Aggregate (cost=2057275.91..2130712.22 rows=979151 width=4) > -> Group (cost=2057275.91..2106233.45 rows=9791508 width=4) > -> Sort (cost=2057275.91..2081754.68 rows=9791508 width=4) > Sort Key: spcode > -> Seq Scan on callticket (cost=0.00..424310.08 rows=9791508 > width=4) > (5 rows) May we see your configuration file ? Regards Gaetano Mendola
I use the default comfiguration file with the tcpip enabled any sagestion about the configuration file ?
On Mon, 8 Sep 2003, Vasilis Ventirozos wrote: > I use the default comfiguration file with the tcpip enabled > any sagestion about the configuration file ? Post a copy of it to the list, along with the specs of the machine it is running on, and I'm sure we'll all pipe in. -- Sam Barnett-Cormack Software Developer | Student of Physics & Maths UK Mirror Service (http://www.mirror.ac.uk) | Lancaster University
The Server is a dual Xeon 2.4 HP with a 15k rpm scsi disk and 2 Gigz of ram # Connection Parameters # tcpip_socket = true #ssl = false #max_connections = 32 #superuser_reserved_connections = 2 port = 5432 #hostname_lookup = false #show_source_port = false #unix_socket_directory = '' #unix_socket_group = '' #unix_socket_permissions = 0777 # octal #virtual_host = '' #krb_server_keyfile = '' # # Shared Memory Size # #shared_buffers = 64 # min max_connections*2 or 16, 8KB each #max_fsm_relations = 1000 # min 10, fsm is free space map, ~40 bytes #max_fsm_pages = 10000 # min 1000, fsm is free space map, ~6 bytes #max_locks_per_transaction = 64 # min 10 #wal_buffers = 8 # min 4, typically 8KB each # # Non-shared Memory Sizes # #sort_mem = 1024 # min 64, size in KB #vacuum_mem = 8192 # min 1024, size in KB# Write-ahead log (WAL) # #checkpoint_segments = 3 # in logfile segments, min 1, 16MB each #checkpoint_timeout = 300 # range 30-3600, in seconds # #commit_delay = 0 # range 0-100000, in microseconds #commit_siblings = 5 # range 1-1000 # #fsync = true #wal_sync_method = fsync # the default varies across platforms: # # fsync, fdatasync, open_sync, or open_datasync #wal_debug = 0 # range 0-16 # # Optimizer Parameters # #enable_seqscan = true #enable_indexscan = true #enable_tidscan = true #enable_sort = true #enable_nestloop = true #enable_mergejoin = true #enable_hashjoin = true #effective_cache_size = 1000 # typically 8KB each #random_page_cost = 4 # units are one sequential page fetch cost #cpu_tuple_cost = 0.01 # (same) #cpu_index_tuple_cost = 0.001 # (same) #cpu_operator_cost = 0.0025 # (same) #default_statistics_target = 10 # range 1-1000 # GEQO Optimizer Parameters # #geqo = true #geqo_selection_bias = 2.0 # range 1.5-2.0 #geqo_threshold = 11 #geqo_pool_size = 0 # default based on tables in statement, # range 128-1024 #geqo_effort = 1 #geqo_generations = 0 #geqo_random_seed = -1 # auto-compute seed # # Message display # #server_min_messages = notice # Values, in order of decreasing detail: # debug5, debug4, debug3, debug2, debug1, # info, notice, warning, error, log, fatal, # panic #client_min_messages = notice # Values, in order of decreasing detail: # debug5, debug4, debug3, debug2, debug1, # log, info, notice, warning, error #silent_mode = false log_connections = true log_pid = true log_statement = true log_duration = true log_timestamp = true #log_min_error_statement = panic # Values in order of increasing severity: # debug5, debug4, debug3, debug2, debug1, # info, notice, warning, error, panic(off) #debug_print_parse = false #debug_print_rewritten = false #debug_print_plan = false #debug_pretty_print = false #explain_pretty_print = true # requires USE_ASSERT_CHECKING #debug_assertions = true # # Syslog # #syslog = 0 # range 0-2 #syslog_facility = 'LOCAL0' #syslog_ident = 'postgres' # # Statistics # #show_parser_stats = false #show_planner_stats = false #show_executor_stats = false #show_statement_stats = false # requires BTREE_BUILD_STATS #show_btree_build_stats = false # # Access statistics collection # #stats_start_collector = true #stats_reset_on_server_start = true #stats_command_string = false #stats_row_level = false #stats_block_level = false # # Lock Tracing # #trace_notify = false # requires LOCK_DEBUG #trace_locks = false #trace_userlocks = false #trace_lwlocks = false #debug_deadlocks = false #trace_lock_oidmin = 16384 #trace_lock_table = 0 # # Misc # #autocommit = true #dynamic_library_path = '$libdir' #search_path = '$user,public' #datestyle = 'iso, us' #timezone = unknown # actually, defaults to TZ environment setting #australian_timezones = false #client_encoding = sql_ascii # actually, defaults to database encoding #authentication_timeout = 60 # 1-600, in seconds #deadlock_timeout = 1000 # in milliseconds #default_transaction_isolation = 'read committed' #max_expr_depth = 10000 # min 10 #max_files_per_process = 1000 # min 25 #password_encryption = true #sql_inheritance = true #transform_null_equals = false #statement_timeout = 0 # 0 is disabled, in milliseconds #db_user_namespace = false # # Locale settings # # (initialized by initdb -- may be changed) LC_MESSAGES = 'C' LC_MONETARY = 'C' LC_NUMERIC = 'C' LC_TIME = 'C'
On Mon, 8 Sep 2003, Vasilis Ventirozos wrote: This bit is simple and probably wants leaving alone, except for very specific changes as you need them > # Connection Parameters > # > tcpip_socket = true > #ssl = false > > #max_connections = 32 > #superuser_reserved_connections = 2 > > port = 5432 > #hostname_lookup = false > #show_source_port = false > > #unix_socket_directory = '' > #unix_socket_group = '' > #unix_socket_permissions = 0777 # octal > > #virtual_host = '' > > #krb_server_keyfile = '' The next two sections are most important: > # > # Shared Memory Size > # > #shared_buffers = 64 # min max_connections*2 or 16, 8KB each > #max_fsm_relations = 1000 # min 10, fsm is free space map, ~40 bytes > #max_fsm_pages = 10000 # min 1000, fsm is free space map, ~6 bytes > #max_locks_per_transaction = 64 # min 10 > #wal_buffers = 8 # min 4, typically 8KB each These probably want upping, quite a lot for your system max_fsm_relations at least 5000, I'd say, max_fsm_pages 50000, shared_buffers can be made to be huge > # > # Non-shared Memory Sizes > # > #sort_mem = 1024 # min 64, size in KB > #vacuum_mem = 8192 # min 1024, size in KB# Write-ahead log > (WAL) These want upping, quite a lot. Like: sort_mem = 65536 vacuum_mem = 32768 Are what I use, and I have a much lower-power box > #checkpoint_segments = 3 # in logfile segments, min 1, 16MB each > #checkpoint_timeout = 300 # range 30-3600, in seconds > # > #commit_delay = 0 # range 0-100000, in microseconds > #commit_siblings = 5 # range 1-1000 > # > #fsync = true > #wal_sync_method = fsync # the default varies across platforms: > # # fsync, fdatasync, open_sync, or > open_datasync > #wal_debug = 0 # range 0-16 Less sure about that lot. The next section is only used by the query planner, debug it if you have silly plans being generated. You probably want to change random_page_cost > # > # Optimizer Parameters > # > #enable_seqscan = true > #enable_indexscan = true > #enable_tidscan = true > #enable_sort = true > #enable_nestloop = true > #enable_mergejoin = true > #enable_hashjoin = true > > #effective_cache_size = 1000 # typically 8KB each > #random_page_cost = 4 # units are one sequential page fetch cost > #cpu_tuple_cost = 0.01 # (same) > #cpu_index_tuple_cost = 0.001 # (same) > #cpu_operator_cost = 0.0025 # (same) > > #default_statistics_target = 10 # range 1-1000 > # GEQO Optimizer Parameters > # > #geqo = true > #geqo_selection_bias = 2.0 # range 1.5-2.0 > #geqo_threshold = 11 > #geqo_pool_size = 0 # default based on tables in statement, > # range 128-1024 > #geqo_effort = 1 > #geqo_generations = 0 > #geqo_random_seed = -1 # auto-compute seed A lot of system stuff now, scroll down... > # > # Message display > # > #server_min_messages = notice # Values, in order of decreasing detail: > # debug5, debug4, debug3, debug2, debug1, > # info, notice, warning, error, log, fatal, > # panic > #client_min_messages = notice # Values, in order of decreasing detail: > # debug5, debug4, debug3, debug2, debug1, > # log, info, notice, warning, error > #silent_mode = false > > log_connections = true > log_pid = true > log_statement = true > log_duration = true > log_timestamp = true > > #log_min_error_statement = panic # Values in order of increasing severity: > # debug5, debug4, debug3, debug2, debug1, > # info, notice, warning, error, panic(off) > > #debug_print_parse = false > #debug_print_rewritten = false > #debug_print_plan = false > #debug_pretty_print = false > > #explain_pretty_print = true > > # requires USE_ASSERT_CHECKING > #debug_assertions = true > > > # > # Syslog > # > #syslog = 0 # range 0-2 > #syslog_facility = 'LOCAL0' > #syslog_ident = 'postgres' > > > # > # Statistics > # > #show_parser_stats = false > #show_planner_stats = false > #show_executor_stats = false > #show_statement_stats = false > > # requires BTREE_BUILD_STATS > #show_btree_build_stats = false > > > # > # Access statistics collection > # > #stats_start_collector = true > #stats_reset_on_server_start = true > > #stats_command_string = false > #stats_row_level = false > #stats_block_level = false > > > # > # Lock Tracing > # > #trace_notify = false > > # requires LOCK_DEBUG > #trace_locks = false > #trace_userlocks = false > #trace_lwlocks = false > #debug_deadlocks = false > #trace_lock_oidmin = 16384 > #trace_lock_table = 0 Some useful stuff in Misc deadlock timeout probably wants increasing, for safety's sake. Other things can be played with. > # > # Misc > # > #autocommit = true > #dynamic_library_path = '$libdir' > #search_path = '$user,public' > #datestyle = 'iso, us' > #timezone = unknown # actually, defaults to TZ environment setting > #australian_timezones = false > #client_encoding = sql_ascii # actually, defaults to database encoding > #authentication_timeout = 60 # 1-600, in seconds > #deadlock_timeout = 1000 # in milliseconds > #default_transaction_isolation = 'read committed' > #max_expr_depth = 10000 # min 10 > #max_files_per_process = 1000 # min 25 > #password_encryption = true > #sql_inheritance = true > #transform_null_equals = false > #statement_timeout = 0 # 0 is disabled, in milliseconds > #db_user_namespace = false > > > > # > # Locale settings > # > # (initialized by initdb -- may be changed) > LC_MESSAGES = 'C' > LC_MONETARY = 'C' > LC_NUMERIC = 'C' > LC_TIME = 'C' > > Hope some of that helps. -- Sam Barnett-Cormack Software Developer | Student of Physics & Maths UK Mirror Service (http://www.mirror.ac.uk) | Lancaster University
On Mon, Sep 08, 2003 at 10:32:51 +0300, Vasilis Ventirozos <vendi@cosmoline.com> wrote: > Hi all, i work in a telco and i have huge ammount of data, (50 million) > but i see a lack of performance at huge tables with postgres, > are 50 million rows the "limit" of postgres ? (with a good performance) > i am waiting for 2004 2 billion records so i have to do something. > Does anyone have a huge database to ask him some issues ? > > my hardware is good ,my indexes are good plz dont answer me something like use > vacuum :) Well if you want specific answers, you need to give us some specific data. For example, explain anaylze output for the slow queries and relevant table definitions.
On Mon, Sep 08, 2003 at 13:26:05 +0300, Vasilis Ventirozos <vendi@cosmoline.com> wrote: > This is a simple statement that i run > > core_netfon=# EXPLAIN select spcode,count(*) from callticket group by spcode; > QUERY PLAN > --------------------------------------------------------------------------------------- > Aggregate (cost=2057275.91..2130712.22 rows=979151 width=4) > -> Group (cost=2057275.91..2106233.45 rows=9791508 width=4) > -> Sort (cost=2057275.91..2081754.68 rows=9791508 width=4) > Sort Key: spcode > -> Seq Scan on callticket (cost=0.00..424310.08 rows=9791508 > width=4) > (5 rows) In addition to making the changes to the config file as suggested in other responses, you may also want to do some testing with the 7.4 beta. Hash aggreates will most likely speed this query up alot (assuming there aren't millions of unique spcodes). The production release of 7.4 will probably happen in about a month.
"Donald Fraser" <demolish@cwgsy.net> writes: > My analysis at the time was that to access random records, performance > deteriorated the further away the records that you were accessing were > from the beginning of the index. For example using a query that had > say OFFSET 250000 would cause large delays. Well, yeah. OFFSET implies generating and discarding that number of records. AFAICS there isn't any shortcut for this, even in a query that's just an indexscan, since the index alone can't tell us whether any given record would actually be returned. regards, tom lane
> Hi all, i work in a telco and i have huge ammount of data, (50 million) > but i see a lack of performance at huge tables with postgres, > are 50 million rows the "limit" of postgres ? (with a good performance) I have worked on a datawarehouse (postgresql 7.3) with a pretty standard star schema with over 250 million rows on the central 'fact' table, and anywhere from 100 to 10+ million records in the surrounding 'dimension' tables. The most common queries were simple joins between 3 tables, with selects on one of the ids. These took a few (1-60) seconds. About 500,000 new records were loaded each night; and the ETL processing and creating some aggregates took about 11 hours/night with 7.3, and 9 hours/night with 7.4beta. Hope this helps.
I have a US zip code look-up table that includes all zip codes within a 300 mile radius.
DATE_PORTAL=# select count(origin) from zips_300 ;
count
-----------
201021979
(1 row)
count
-----------
201021979
(1 row)
The row count is shown above. I get great , great performance. However with large data sets, you need a lot of space for indexes. In my production system, I moved the indexes to another disk.
Postgresql can handle 50 million rows with no problem, just index the data properly.
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software