Thread: Are 50 million rows a problem for postgres ?

Are 50 million rows a problem for postgres ?

From

Vasilis Ventirozos

Date:

08 September 2003, 04:33:06

Hi all, i work in a telco and i have huge ammount of data, (50 million)
but i see a lack of performance at huge tables with postgres,
are 50 million rows the "limit" of postgres ? (with a good performance)
i am waiting for 2004 2 billion records so i have to do something.
Does anyone have a huge database to ask him some issues ?

my hardware is good ,my indexes are good plz dont answer me something like use
vacuum  :)



The Joy of Index

Vasilis Ventirozos
--------------------

Re: Are 50 million rows a problem for postgres ?

From

Hornyak Laszlo

Date:

08 September 2003, 05:33:54

Could you give more detailed information?
What does explain say?


On Mon, 8 Sep 2003, Vasilis Ventirozos wrote:

> Hi all, i work in a telco and i have huge ammount of data, (50 million)
> but i see a lack of performance at huge tables with postgres,
> are 50 million rows the "limit" of postgres ? (with a good performance)
> i am waiting for 2004 2 billion records so i have to do something.
> Does anyone have a huge database to ask him some issues ?
>
> my hardware is good ,my indexes are good plz dont answer me something like use
> vacuum  :)
>
>
>
> The Joy of Index
>
> Vasilis Ventirozos
> --------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

Re: Are 50 million rows a problem for postgres ?

From

Vasilis Ventirozos

Date:

08 September 2003, 05:56:38

it's not a stadard statement, i am tring to get statistics for the company and
i see a lack of performance (the same statement on informix runs good with
the same indexes of course)


Vasilis Ventirozos

Re: Are 50 million rows a problem for postgres ?

From

Sam Barnett-Cormack

Date:

08 September 2003, 06:13:23

On Mon, 8 Sep 2003, Vasilis Ventirozos wrote:

> Hi all, i work in a telco and i have huge ammount of data, (50 million)
> but i see a lack of performance at huge tables with postgres,
> are 50 million rows the "limit" of postgres ? (with a good performance)
> i am waiting for 2004 2 billion records so i have to do something.
> Does anyone have a huge database to ask him some issues ?
>
> my hardware is good ,my indexes are good plz dont answer me something like use
> vacuum  :)

I have a similarly huge number of records, as I process our web, ftp,
and rsync logs together using postgres. Works like a charm. You do have
to allow that queries are going to take a long time. I use about 6
queries to summarise a quarter's data - each run for each month, so a
total of 18 queries. These run in a little over 24 hours. And there are
many, many records per month.

--

Sam Barnett-Cormack
Software Developer                           |  Student of Physics & Maths
UK Mirror Service (http://www.mirror.ac.uk)  |  Lancaster University

Re: Are 50 million rows a problem for postgres ?

From

"Donald Fraser"

Date:

08 September 2003, 07:03:20

> On Mon, 8 Sep 2003, Vasilis Ventirozos wrote:
>
> > Hi all, i work in a telco and i have huge ammount of data, (50 million)
> > but i see a lack of performance at huge tables with postgres,
> > are 50 million rows the "limit" of postgres ? (with a good performance)
> > i am waiting for 2004 2 billion records so i have to do something.
> > Does anyone have a huge database to ask him some issues ?
> >
> > my hardware is good ,my indexes are good plz dont answer me something like
use
> > vacuum  :)
>

I did some performance testing back on PostgreSQL version 7.2 on a table of
350,000 records.
My analysis at the time was that to access random records, performance
deteriorated the further away the records that you were accessing were from the
beginning of the index. For example using a query that had say OFFSET 250000
would cause large delays. On a text index these delays were in the order of 60
seconds! Which was unacceptable for the application I was developing.
To overcome this problem I had to do away with queries that used OFFSET and
developed queries that accessed the data relative to previous records I had
accessed. The only draw back was that I had to make the indexes unique by
appending a unique id column as the last column in the index. I now have no
problem scanning a table in access of 1 million records in small chunks at a
time.

Anyway without seeing what sort of query you are having problems with nobody on
these email lists will be able to fully help you.
Minimum we need to see an SQL statement, and the results of  EXPLAIN.

Regards
Donald Fraser.

Re: Are 50 million rows a problem for postgres ?

From

Vasilis Ventirozos

Date:

08 September 2003, 07:26:30

This is a simple statement that i run

core_netfon=# EXPLAIN select spcode,count(*) from callticket group by spcode;
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Aggregate  (cost=2057275.91..2130712.22 rows=979151 width=4)
   ->  Group  (cost=2057275.91..2106233.45 rows=9791508 width=4)
         ->  Sort  (cost=2057275.91..2081754.68 rows=9791508 width=4)
               Sort Key: spcode
               ->  Seq Scan on callticket  (cost=0.00..424310.08 rows=9791508
width=4)
(5 rows)

Re: Are 50 million rows a problem for postgres ?

From

Sam Barnett-Cormack

Date:

08 September 2003, 07:39:24

On Mon, 8 Sep 2003, Vasilis Ventirozos wrote:

> This is a simple statement that i run
>
> core_netfon=# EXPLAIN select spcode,count(*) from callticket group by spcode;

Well, yeah. Whatever you do, a complete seqscan and count is going to
take a long time, in the order of hours rather than days I would expect.
However, you may want to ensure that all tuning in postgresql.conf is
correct, as it may not be using all possible resources. That will
probably only make a small difference.

--

Sam Barnett-Cormack
Software Developer                           |  Student of Physics & Maths
UK Mirror Service (http://www.mirror.ac.uk)  |  Lancaster University

Re: Are 50 million rows a problem for postgres ?

From

"Gaetano Mendola"

Date:

08 September 2003, 07:58:07

"Vasilis Ventirozos" <vendi@cosmoline.com> wrote:
> This is a simple statement that i run
>
> core_netfon=# EXPLAIN select spcode,count(*) from callticket group by
spcode;
>                                       QUERY PLAN
> --------------------------------------------------------------------------
-------------
>  Aggregate  (cost=2057275.91..2130712.22 rows=979151 width=4)
>    ->  Group  (cost=2057275.91..2106233.45 rows=9791508 width=4)
>          ->  Sort  (cost=2057275.91..2081754.68 rows=9791508 width=4)
>                Sort Key: spcode
>                ->  Seq Scan on callticket  (cost=0.00..424310.08
rows=9791508
> width=4)
> (5 rows)

May we see your configuration file ?

Regards
Gaetano Mendola

Re: Are 50 million rows a problem for postgres ?

From

Vasilis Ventirozos

Date:

08 September 2003, 08:18:39

I use the default comfiguration file with the tcpip enabled
any sagestion about the configuration file ?

Re: Are 50 million rows a problem for postgres ?

From

Sam Barnett-Cormack

Date:

08 September 2003, 08:28:18

On Mon, 8 Sep 2003, Vasilis Ventirozos wrote:

> I use the default comfiguration file with the tcpip enabled
> any sagestion about the configuration file ?

Post a copy of it to the list, along with the specs of the machine it is
running on, and I'm sure we'll all pipe in.

--

Sam Barnett-Cormack
Software Developer                           |  Student of Physics & Maths
UK Mirror Service (http://www.mirror.ac.uk)  |  Lancaster University

Re: Are 50 million rows a problem for postgres ?

From

Vasilis Ventirozos

Date:

08 September 2003, 08:31:28

The Server is a dual Xeon 2.4 HP with a 15k rpm scsi disk and 2 Gigz of ram










#       Connection Parameters
#
tcpip_socket = true
#ssl = false

#max_connections = 32
#superuser_reserved_connections = 2

port = 5432
#hostname_lookup = false
#show_source_port = false

#unix_socket_directory = ''
#unix_socket_group = ''
#unix_socket_permissions = 0777 # octal

#virtual_host = ''

#krb_server_keyfile = ''


#
#       Shared Memory Size
#
#shared_buffers = 64            # min max_connections*2 or 16, 8KB each
#max_fsm_relations = 1000       # min 10, fsm is free space map, ~40 bytes
#max_fsm_pages = 10000          # min 1000, fsm is free space map, ~6 bytes
#max_locks_per_transaction = 64 # min 10
#wal_buffers = 8                # min 4, typically 8KB each

#
#       Non-shared Memory Sizes
#
#sort_mem = 1024                # min 64, size in KB
#vacuum_mem = 8192              # min 1024, size in KB#       Write-ahead log
(WAL)
#
#checkpoint_segments = 3        # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 300       # range 30-3600, in seconds
#
#commit_delay = 0               # range 0-100000, in microseconds
#commit_siblings = 5            # range 1-1000
#
#fsync = true
#wal_sync_method = fsync        # the default varies across platforms:
#                               # fsync, fdatasync, open_sync, or
open_datasync
#wal_debug = 0                  # range 0-16


#
#       Optimizer Parameters
#
#enable_seqscan = true
#enable_indexscan = true
#enable_tidscan = true
#enable_sort = true
#enable_nestloop = true
#enable_mergejoin = true
#enable_hashjoin = true

#effective_cache_size = 1000    # typically 8KB each
#random_page_cost = 4           # units are one sequential page fetch cost
#cpu_tuple_cost = 0.01          # (same)
#cpu_index_tuple_cost = 0.001   # (same)
#cpu_operator_cost = 0.0025     # (same)

#default_statistics_target = 10 # range 1-1000
#       GEQO Optimizer Parameters
#
#geqo = true
#geqo_selection_bias = 2.0      # range 1.5-2.0
#geqo_threshold = 11
#geqo_pool_size = 0             # default based on tables in statement,
                                # range 128-1024
#geqo_effort = 1
#geqo_generations = 0
#geqo_random_seed = -1          # auto-compute seed


#
#       Message display
#
#server_min_messages = notice   # Values, in order of decreasing detail:
                                #   debug5, debug4, debug3, debug2, debug1,
                                #   info, notice, warning, error, log, fatal,
                                #   panic
#client_min_messages = notice   # Values, in order of decreasing detail:
                                #   debug5, debug4, debug3, debug2, debug1,
                                #   log, info, notice, warning, error
#silent_mode = false

log_connections = true
log_pid = true
log_statement = true
log_duration = true
log_timestamp = true

#log_min_error_statement = panic # Values in order of increasing severity:
                                 #   debug5, debug4, debug3, debug2, debug1,
                                 #   info, notice, warning, error, panic(off)

#debug_print_parse = false
#debug_print_rewritten = false
#debug_print_plan = false
#debug_pretty_print = false

#explain_pretty_print = true

# requires USE_ASSERT_CHECKING
#debug_assertions = true


#
#       Syslog
#
#syslog = 0                     # range 0-2
#syslog_facility = 'LOCAL0'
#syslog_ident = 'postgres'


#
#       Statistics
#
#show_parser_stats = false
#show_planner_stats = false
#show_executor_stats = false
#show_statement_stats = false

# requires BTREE_BUILD_STATS
#show_btree_build_stats = false


#
#       Access statistics collection
#
#stats_start_collector = true
#stats_reset_on_server_start = true

#stats_command_string = false
#stats_row_level = false
#stats_block_level = false


#
#       Lock Tracing
#
#trace_notify = false

# requires LOCK_DEBUG
#trace_locks = false
#trace_userlocks = false
#trace_lwlocks = false
#debug_deadlocks = false
#trace_lock_oidmin = 16384
#trace_lock_table = 0


#
#       Misc
#
#autocommit = true
#dynamic_library_path = '$libdir'
#search_path = '$user,public'
#datestyle = 'iso, us'
#timezone = unknown             # actually, defaults to TZ environment setting
#australian_timezones = false
#client_encoding = sql_ascii    # actually, defaults to database encoding
#authentication_timeout = 60    # 1-600, in seconds
#deadlock_timeout = 1000        # in milliseconds
#default_transaction_isolation = 'read committed'
#max_expr_depth = 10000         # min 10
#max_files_per_process = 1000   # min 25
#password_encryption = true
#sql_inheritance = true
#transform_null_equals = false
#statement_timeout = 0          # 0 is disabled, in milliseconds
#db_user_namespace = false



#
#       Locale settings
#
# (initialized by initdb -- may be changed)
LC_MESSAGES = 'C'
LC_MONETARY = 'C'
LC_NUMERIC = 'C'
LC_TIME = 'C'

Re: Are 50 million rows a problem for postgres ?

From

Sam Barnett-Cormack

Date:

08 September 2003, 08:39:40

On Mon, 8 Sep 2003, Vasilis Ventirozos wrote:

This bit is simple and probably wants leaving alone, except for very
specific changes as you need them

> #       Connection Parameters
> #
> tcpip_socket = true
> #ssl = false
>
> #max_connections = 32
> #superuser_reserved_connections = 2
>
> port = 5432
> #hostname_lookup = false
> #show_source_port = false
>
> #unix_socket_directory = ''
> #unix_socket_group = ''
> #unix_socket_permissions = 0777 # octal
>
> #virtual_host = ''
>
> #krb_server_keyfile = ''

The next two sections are most important:

> #
> #       Shared Memory Size
> #
> #shared_buffers = 64            # min max_connections*2 or 16, 8KB each
> #max_fsm_relations = 1000       # min 10, fsm is free space map, ~40 bytes
> #max_fsm_pages = 10000          # min 1000, fsm is free space map, ~6 bytes
> #max_locks_per_transaction = 64 # min 10
> #wal_buffers = 8                # min 4, typically 8KB each

These probably want upping, quite a lot for your system

max_fsm_relations at least 5000, I'd say, max_fsm_pages 50000,
shared_buffers can be made to be huge

> #
> #       Non-shared Memory Sizes
> #
> #sort_mem = 1024                # min 64, size in KB
> #vacuum_mem = 8192              # min 1024, size in KB#       Write-ahead log
> (WAL)

These want upping, quite a lot. Like:

sort_mem = 65536
vacuum_mem = 32768

Are what I use, and I have a much lower-power box

> #checkpoint_segments = 3        # in logfile segments, min 1, 16MB each
> #checkpoint_timeout = 300       # range 30-3600, in seconds
> #
> #commit_delay = 0               # range 0-100000, in microseconds
> #commit_siblings = 5            # range 1-1000
> #
> #fsync = true
> #wal_sync_method = fsync        # the default varies across platforms:
> #                               # fsync, fdatasync, open_sync, or
> open_datasync
> #wal_debug = 0                  # range 0-16

Less sure about that lot.

The next section is only used by the query planner, debug it if you have
silly plans being generated. You probably want to change
random_page_cost

> #
> #       Optimizer Parameters
> #
> #enable_seqscan = true
> #enable_indexscan = true
> #enable_tidscan = true
> #enable_sort = true
> #enable_nestloop = true
> #enable_mergejoin = true
> #enable_hashjoin = true
>
> #effective_cache_size = 1000    # typically 8KB each
> #random_page_cost = 4           # units are one sequential page fetch cost
> #cpu_tuple_cost = 0.01          # (same)
> #cpu_index_tuple_cost = 0.001   # (same)
> #cpu_operator_cost = 0.0025     # (same)
>
> #default_statistics_target = 10 # range 1-1000
> #       GEQO Optimizer Parameters
> #
> #geqo = true
> #geqo_selection_bias = 2.0      # range 1.5-2.0
> #geqo_threshold = 11
> #geqo_pool_size = 0             # default based on tables in statement,
>                                 # range 128-1024
> #geqo_effort = 1
> #geqo_generations = 0
> #geqo_random_seed = -1          # auto-compute seed

A lot of system stuff now, scroll down...

> #
> #       Message display
> #
> #server_min_messages = notice   # Values, in order of decreasing detail:
>                                 #   debug5, debug4, debug3, debug2, debug1,
>                                 #   info, notice, warning, error, log, fatal,
>                                 #   panic
> #client_min_messages = notice   # Values, in order of decreasing detail:
>                                 #   debug5, debug4, debug3, debug2, debug1,
>                                 #   log, info, notice, warning, error
> #silent_mode = false
>
> log_connections = true
> log_pid = true
> log_statement = true
> log_duration = true
> log_timestamp = true
>
> #log_min_error_statement = panic # Values in order of increasing severity:
>                                  #   debug5, debug4, debug3, debug2, debug1,
>                                  #   info, notice, warning, error, panic(off)
>
> #debug_print_parse = false
> #debug_print_rewritten = false
> #debug_print_plan = false
> #debug_pretty_print = false
>
> #explain_pretty_print = true
>
> # requires USE_ASSERT_CHECKING
> #debug_assertions = true
>
>
> #
> #       Syslog
> #
> #syslog = 0                     # range 0-2
> #syslog_facility = 'LOCAL0'
> #syslog_ident = 'postgres'
>
>
> #
> #       Statistics
> #
> #show_parser_stats = false
> #show_planner_stats = false
> #show_executor_stats = false
> #show_statement_stats = false
>
> # requires BTREE_BUILD_STATS
> #show_btree_build_stats = false
>
>
> #
> #       Access statistics collection
> #
> #stats_start_collector = true
> #stats_reset_on_server_start = true
>
> #stats_command_string = false
> #stats_row_level = false
> #stats_block_level = false
>
>
> #
> #       Lock Tracing
> #
> #trace_notify = false
>
> # requires LOCK_DEBUG
> #trace_locks = false
> #trace_userlocks = false
> #trace_lwlocks = false
> #debug_deadlocks = false
> #trace_lock_oidmin = 16384
> #trace_lock_table = 0

Some useful stuff in Misc

deadlock timeout probably wants increasing, for safety's sake. Other
things can be played with.

> #
> #       Misc
> #
> #autocommit = true
> #dynamic_library_path = '$libdir'
> #search_path = '$user,public'
> #datestyle = 'iso, us'
> #timezone = unknown             # actually, defaults to TZ environment setting
> #australian_timezones = false
> #client_encoding = sql_ascii    # actually, defaults to database encoding
> #authentication_timeout = 60    # 1-600, in seconds
> #deadlock_timeout = 1000        # in milliseconds
> #default_transaction_isolation = 'read committed'
> #max_expr_depth = 10000         # min 10
> #max_files_per_process = 1000   # min 25
> #password_encryption = true
> #sql_inheritance = true
> #transform_null_equals = false
> #statement_timeout = 0          # 0 is disabled, in milliseconds
> #db_user_namespace = false
>
>
>
> #
> #       Locale settings
> #
> # (initialized by initdb -- may be changed)
> LC_MESSAGES = 'C'
> LC_MONETARY = 'C'
> LC_NUMERIC = 'C'
> LC_TIME = 'C'
>
>

Hope some of that helps.

--

Sam Barnett-Cormack
Software Developer                           |  Student of Physics & Maths
UK Mirror Service (http://www.mirror.ac.uk)  |  Lancaster University

Re: Are 50 million rows a problem for postgres ?

From

Bruno Wolff III

Date:

08 September 2003, 10:11:26

On Mon, Sep 08, 2003 at 10:32:51 +0300,
  Vasilis Ventirozos <vendi@cosmoline.com> wrote:
> Hi all, i work in a telco and i have huge ammount of data, (50 million)
> but i see a lack of performance at huge tables with postgres,
> are 50 million rows the "limit" of postgres ? (with a good performance)
> i am waiting for 2004 2 billion records so i have to do something.
> Does anyone have a huge database to ask him some issues ?
>
> my hardware is good ,my indexes are good plz dont answer me something like use
> vacuum  :)

Well if you want specific answers, you need to give us some specific data.
For example, explain anaylze output for the slow queries and relevant
table definitions.

Re: Are 50 million rows a problem for postgres ?

From

Bruno Wolff III

Date:

08 September 2003, 10:15:41

On Mon, Sep 08, 2003 at 13:26:05 +0300,
  Vasilis Ventirozos <vendi@cosmoline.com> wrote:
> This is a simple statement that i run
>
> core_netfon=# EXPLAIN select spcode,count(*) from callticket group by spcode;
>                                       QUERY PLAN
> ---------------------------------------------------------------------------------------
>  Aggregate  (cost=2057275.91..2130712.22 rows=979151 width=4)
>    ->  Group  (cost=2057275.91..2106233.45 rows=9791508 width=4)
>          ->  Sort  (cost=2057275.91..2081754.68 rows=9791508 width=4)
>                Sort Key: spcode
>                ->  Seq Scan on callticket  (cost=0.00..424310.08 rows=9791508
> width=4)
> (5 rows)

In addition to making the changes to the config file as suggested in other
responses, you may also want to do some testing with the 7.4 beta.
Hash aggreates will most likely speed this query up alot (assuming there
aren't millions of unique spcodes). The production release of 7.4 will
probably happen in about a month.

Re: Are 50 million rows a problem for postgres ?

From

Tom Lane

Date:

08 September 2003, 12:14:25

"Donald Fraser" <demolish@cwgsy.net> writes:
> My analysis at the time was that to access random records, performance
> deteriorated the further away the records that you were accessing were
> from the beginning of the index. For example using a query that had
> say OFFSET 250000 would cause large delays.

Well, yeah.  OFFSET implies generating and discarding that number of
records.  AFAICS there isn't any shortcut for this, even in a query
that's just an indexscan, since the index alone can't tell us whether
any given record would actually be returned.

            regards, tom lane

Re: Are 50 million rows a problem for postgres ?

From

"Ron Mayer"

Date:

09 September 2003, 22:00:22

> Hi all, i work in a telco and i have huge ammount of data, (50 million)
> but i see a lack of performance at huge tables with postgres,
> are 50 million rows the "limit" of postgres ? (with a good performance)

I have worked on a datawarehouse (postgresql 7.3) with a
pretty standard star schema with over 250 million rows on
the central 'fact' table, and anywhere from 100 to 10+ million
records in the surrounding 'dimension' tables.

The most common queries were simple joins between 3 tables, with
selects on one of the ids.  These took a few (1-60) seconds.
About 500,000 new records were loaded each night; and the ETL
processing and creating some aggregates took about 11 hours/night
with 7.3, and 9 hours/night with 7.4beta.

Hope this helps.

Re: Are 50 million rows a problem for postgres ?

From

Christopher Smith

Date:

20 September 2003, 11:12:38

I have a US zip code look-up table that includes all zip codes within a 300 mile radius.

DATE_PORTAL=# select count(origin) from zips_300 ;
count
-----------
201021979
(1 row)

The row count is shown above. I get great , great performance. However with large data sets, you need a lot of space for indexes. In my production system, I moved the indexes to another disk.

Postgresql can handle 50 million rows with no problem, just index the data properly.

Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software