Re: Optimize update query - Mailing list pgsql-performance

From Niels Kristian Schjødt
Subject Re: Optimize update query
Date
Msg-id CAC9DA03-64C2-45E2-9B88-703F9768F574@autouncle.com
Whole thread Raw
In response to Re: Optimize update query  (Willem Leenen <willem_leenen@hotmail.com>)
Responses Re: Optimize update query
Re: Optimize update query
List pgsql-performance
Okay guys,

Thanks for all the great help and advice already! Let me just clear some things, to make my question a little easier to answer :-)
Now my site is a search engine for used cars - not just a car shop with a few hundred cars.
The update query you look at, is an update that is executed once a day in chunks for all active adverts, so we know they are still for sale (one car can be advertised at several places hence several "adverts"). So it's not a "constant stream" but it has a fairly high volume especially at night time though.

A compressed version of my .conf looks like this (note: there is some tweaks at the end of the file)
  data_directory = '/var/lib/postgresql/9.2/main'
  hba_file = '/etc/postgresql/9.2/main/pg_hba.conf' 
  ident_file = '/etc/postgresql/9.2/main/pg_ident.conf'
  external_pid_file = '/var/run/postgresql/9.2-main.pid' 
  listen_addresses = '192.168.0.2, localhost'
  port = 5432
  max_connections = 1000 
  unix_socket_directory = '/var/run/postgresql'
  wal_level = hot_standby
  synchronous_commit = off
  archive_mode = onarchive_command = 'rsync -a %p postgres@192.168.0.4:/var/lib/postgresql/9.2/wals/%f </dev/null' 
  max_wal_senders = 1 
  wal_keep_segments = 32
  logging_collector = on 
  log_min_messages = debug1 
  log_min_error_statement = debug1
  log_min_duration_statement = 0
  log_checkpoints = on
  log_connections = on
  log_disconnections = onlog_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d '
  log_lock_waits = on log_temp_files = 0
  datestyle = 'iso, mdy' 
  lc_messages = 'C'
  lc_monetary = 'en_US.UTF-8'
  lc_numeric = 'en_US.UTF-8' 
  lc_time = 'en_US.UTF-8' 
  default_text_search_config = 'pg_catalog.english' 
  default_statistics_target = 100
  maintenance_work_mem = 1GB
  checkpoint_completion_target = 0.7
  effective_cache_size = 22GB
  work_mem = 160MB
  wal_buffers = 4MB
  checkpoint_segments = 16
  shared_buffers = 7680MB

# All the log stuff is mainly temporary requirement for pgBadger
# The database has been tuned with pgtuner

You might be familiar with new relic, and I use that for quite a lot of monitoring. So, this is what I see at night time (a lot of I/O). So I went to play around with pgBadger to get some insights at database level.
<iframe src="https://rpm.newrelic.com/public/charts/h2dtedghfsv" width="500" height="300" scrolling="no" frameborder="no"></iframe>

This shows me, that the by far most time-consuming queries are updates (in general). On avg. a query like the one I showed you, take 1,3 sec (but often it takes several minutes - which makes me wonder). So correct me if I'm wrong here: my theory is, that I have too many too slow update queries, that then often end up in a situation, where they "wait" for each other to finish, hence the sometimes VERY long execution times. So my basic idea here is, that if I could reduce the cost of the updates, then I could get a hight throughput overall.

Here is a sample of the pgBadger analysis:

Queries that took up the most time (N) ^
Rank Total duration Times executed Av. duration (s) Query
1 1d15h28m38.71s
948,711
0.15s
COMMIT;

2 1d2h17m55.43s
401,002
0.24s
INSERT INTO "car_images" ( "car_id", "created_at", "image", "updated_at" ) VALUES ( '', '', '', '' ) returning "id";

3 23h18m33.68s
195,093
0.43s
SELECT DISTINCT "cars".id FROM "cars" LEFT OUTER JOIN "adverts" ON "adverts"."car_id" = "cars"."id" LEFT OUTERJOIN "sellers" ON "sellers"."id" = "adverts"."seller_id" WHERE "cars"."sales_state" = '' AND "cars"."year" = 0 AND"cars"."engine_size" = 0.0 AND ( ( "cars"."id" IS NOT NULL AND "cars"."brand" = '' AND "cars"."model_name" = ''AND "cars"."fuel" = '' AND "cars"."km" = 0 AND "cars"."price" = 0 AND "sellers"."kind" = '' ) ) LIMIT 0;

4 22h45m26.52s
3,374,133
0.02s
SELECT "adverts".* FROM "adverts" WHERE ( source_name = '' AND md5 ( url ) = md5 ( '' ) ) LIMIT 0;

5 10h31m37.18s
29,671
1.28s
UPDATE "adverts" SET "last_observed_at" = '', "data_source_id" = '' WHERE "adverts"."id" IN ( ... ) ;

6 7h18m40.65s
396,393
0.07s
UPDATE "cars" SET "updated_at" = '' WHERE "cars"."id" = 0;

7 7h6m7.87s
241,294
0.11s
UPDATE "cars" SET "images_count" = COALESCE ( "images_count", 0 ) + 0 WHERE "cars"."id" = 0;

8 6h56m11.78s
84,571
0.30s
INSERT INTO "failed_adverts" ( "active_record_object_class", "advert_candidate", "created_at", "exception_class","exception_message", "from_rescraper", "last_retried_at", "retry_count", "source_name", "stack_trace","updated_at", "url" ) VALUES ( NULL, '', '', '', '', NULL, NULL, '', '', '', '', '' ) returning "id";

9 5h47m25.45s
188,402
0.11s
INSERT INTO "adverts" ( "availability_state", "car_id", "created_at", "data_source_id", "deactivated_at","first_extraction", "last_observed_at", "price", "seller_id", "source_id", "source_name", "updated_at", "url" )VALUES ( '', '', '', '', NULL, '', '', '', '', '', '', '', '' ) returning "id";

10 3h4m26.86s
166,235
0.07s
UPDATE "adverts" SET "deactivated_at" = '', "availability_state" = '', "updated_at" = '' WHERE "adverts"."id" = 0;

(Yes I'm already on the task of improving the selects)

Den 28/11/2012 kl. 16.11 skrev Willem Leenen <willem_leenen@hotmail.com>:


I assume that SQL databases ( Banks? Telecom?) can handle an used car shop. No need for an unstructured data tool.



> +1, sql databases has limited number of inserts/updates per second. Even
> with highend hardware you won't have more than XXX operations per
> second. As Thomas said, you should feed something like nosql database
> from www server and use other tool to do aggregation and batch inserts
> to postgresql. It will scale much better.
> 
> Marcin

pgsql-performance by date:

Previous
From: Willem Leenen
Date:
Subject: Re: Savepoints in transactions for speed?
Next
From: Bèrto ëd Sèra
Date:
Subject: Re: Optimize update query