Thread: Why so slow?

Why so slow?

From
Alessandro Gagliardi
Date:
Comparing 
SELECT DISTINCT(user_id) FROM blocks JOIN seen_its USING (user_id) WHERE seen_its.created BETWEEN (now()::date - interval '8 days')::timestamp AND now()::date::timestamp
to
SELECT DISTINCT(user_id) FROM seen_its WHERE created BETWEEN (now()::date - interval '8 days')::timestamp AND now()::date::timestamp
the difference is 100x. 

Here are my tables:

CREATE TABLE seen_its (
  user_id character(24) NOT NULL,
  moment_id character(24) NOT NULL,
  created timestamp without time zone,
  inserted timestamp without time zone DEFAULT now(),
  CONSTRAINT seen_its_pkey PRIMARY KEY (user_id , moment_id )
) WITH ( OIDS=FALSE );

CREATE INDEX seen_its_created_idx  ON seen_its  USING btree  (created );

CREATE INDEX seen_its_user_id_idx  ON seen_its  USING btree  (user_id );

CREATE TABLE blocks (
  block_id character(24) NOT NULL,
  user_id character(24) NOT NULL,
  created timestamp with time zone,
  locale character varying,
  shared boolean,
  private boolean,
  moment_type character varying NOT NULL,
  user_agent character varying,
  inserted timestamp without time zone NOT NULL DEFAULT now(),
  networks character varying[],
  lnglat point,
  timezone character varying,
  geohash character varying(20),
  CONSTRAINT blocks_pkey PRIMARY KEY (block_id )
) WITH ( OIDS=FALSE );

CREATE INDEX blocks_created_at_timezone_idx  ON blocks  USING btree  (timezone(timezone::text, created) );

CREATE INDEX blocks_created_idx  ON blocks  USING btree  (created  DESC NULLS LAST);
CREATE INDEX blocks_geohash_idx  ON blocks  USING btree  (geohash );
CREATE INDEX blocks_timezone_idx  ON blocks  USING btree  (timezone );
CREATE INDEX blocks_user_id_idx  ON blocks  USING btree  (user_id );

My blocks table has about 17M rows in it. My seen_its table has 1.9M rows in it (though that is expected to grow into the billions). 

Here is the EXPLAIN: http://explain.depesz.com/s/ley

I'm using PostgreSQL 9.0.6 on i486-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 32-bit

My random_page_cost is 2 and yet it still insists on using Seq Scan on blocks.

Whenever I use my blocks table, this seems to happen. I'm not sure what's wrong. 

Any help would be much appreciated.

Thank you,
-Alessandro

Re: Why so slow?

From
Steve Crawford
Date:
On 02/17/2012 10:34 AM, Alessandro Gagliardi wrote:
> Comparing
> SELECT DISTINCT(user_id) FROM blocks JOIN seen_its USING (user_id)
> WHERE seen_its.created BETWEEN (now()::date - interval '8
> days')::timestamp AND now()::date::timestamp
> to
> SELECT DISTINCT(user_id) FROM seen_its WHERE created BETWEEN
> (now()::date - interval '8 days')::timestamp AND now()::date::timestamp
> the difference is 100x.
> ...
Though I could figure it out, it would be helpful to actually specify
which query is faster and to post the explain of *both* queries.

But in general, it is not terribly unusual to find that rewriting a
query can lead the planner to generate a superior plan. Trying and
testing different ways of writing a query is a standard tuning technique.

There are also version-specific issues with some versions of PostgreSQL
preferring ...where foo in (select... and others preferring ...where
exists (select...

If you are planning to ramp up to high volumes it is also *very*
important to test and tune using the size of database you plan to have
on the hardware you will use in production. You cannot extrapolate from
a dev database on an i486 (?!?) machine to a production server with more
spindles, different RAID setup, different CPU, more cores, vastly more
memory, etc.

In the case of your queries, the second one eliminates a join and gives
the planner an easy way to optimize using the available indexes so I'm
not surprised it's faster.

Note: I am guessing that your seen_its table just grows and grows but is
rarely, if ever, modified. If it is basically a log-type table it will
be a prime candidate for partitioning on date and queries like this will
only need to access a couple relatively small child tables instead of
one massive one.

Cheers,
Steve


Re: Why so slow?

From
Alessandro Gagliardi
Date:
Your guess about the seen_its table growing is accurate and applies to the blocks table as well. Partitioning on date is probably a good idea and something that I've been meaning to investigate. I'm not surprised that the JOIN makes it slower, I'm surprised by the magnitude of how much slower it is.

This is my analytics database (not dev) so no extrapolation is necessary except in that I know the tables will grow in size. The database is hosted on AWS and maintained by Heroku. 

On Fri, Feb 17, 2012 at 11:21 AM, Steve Crawford <scrawford@pinpointresearch.com> wrote:
On 02/17/2012 10:34 AM, Alessandro Gagliardi wrote:
Comparing
SELECT DISTINCT(user_id) FROM blocks JOIN seen_its USING (user_id) WHERE seen_its.created BETWEEN (now()::date - interval '8 days')::timestamp AND now()::date::timestamp
to
SELECT DISTINCT(user_id) FROM seen_its WHERE created BETWEEN (now()::date - interval '8 days')::timestamp AND now()::date::timestamp
the difference is 100x.
...
Though I could figure it out, it would be helpful to actually specify which query is faster and to post the explain of *both* queries.

But in general, it is not terribly unusual to find that rewriting a query can lead the planner to generate a superior plan. Trying and testing different ways of writing a query is a standard tuning technique.

There are also version-specific issues with some versions of PostgreSQL preferring ...where foo in (select... and others preferring ...where exists (select...

If you are planning to ramp up to high volumes it is also *very* important to test and tune using the size of database you plan to have on the hardware you will use in production. You cannot extrapolate from a dev database on an i486 (?!?) machine to a production server with more spindles, different RAID setup, different CPU, more cores, vastly more memory, etc.

In the case of your queries, the second one eliminates a join and gives the planner an easy way to optimize using the available indexes so I'm not surprised it's faster.

Note: I am guessing that your seen_its table just grows and grows but is rarely, if ever, modified. If it is basically a log-type table it will be a prime candidate for partitioning on date and queries like this will only need to access a couple relatively small child tables instead of one massive one.

Cheers,
Steve


Re: Why so slow?

From
Ants Aasma
Date:

On Feb 17, 2012 8:35 PM, "Alessandro Gagliardi" <alessandro@path.com> wrote:
> Here is the EXPLAIN: http://explain.depesz.com/s/ley
>
> I'm using PostgreSQL 9.0.6 on i486-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 32-bit
>
> My random_page_cost is 2 and yet it still insists on using Seq Scan on blocks.

As could be inferred from the row counts, it's slow because its joining and then aggregating a quarter of the blocks table. The hash join with its sequential scan is probably the correct choice for that type of join, it's the join itself that should be optimized out. The optimizer doesn't figure out that the join can be turned into a semi join if the output is aggregated with distinct and is from only one of the tables (in this case, because the output is the join key, it can be from either table).

To make the optimizers job easier you can rewrite it as a semi-join explicitly:
SELECT DISTINCT(user_id) FROM seen_its WHERE EXISTS (SELECT 1 FROM blocks WHERE blocks.user_id = seen_its.user_id) AND seen_its.created BETWEEN (now()::date - interval '8 days')::timestamp AND now()::date::timestamp

--
Ants Aasma

Re: Why so slow?

From
Alessandro Gagliardi
Date:
Ah, that did make a big difference! It went from taking 10x as long to taking only 1.5x as long (about what I would have expected, if not better.) Thank you!

On Fri, Feb 17, 2012 at 9:29 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:

On Feb 17, 2012 8:35 PM, "Alessandro Gagliardi" <alessandro@path.com> wrote:
> Here is the EXPLAIN: http://explain.depesz.com/s/ley
>
> I'm using PostgreSQL 9.0.6 on i486-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 32-bit
>
> My random_page_cost is 2 and yet it still insists on using Seq Scan on blocks.

As could be inferred from the row counts, it's slow because its joining and then aggregating a quarter of the blocks table. The hash join with its sequential scan is probably the correct choice for that type of join, it's the join itself that should be optimized out. The optimizer doesn't figure out that the join can be turned into a semi join if the output is aggregated with distinct and is from only one of the tables (in this case, because the output is the join key, it can be from either table).

To make the optimizers job easier you can rewrite it as a semi-join explicitly:
SELECT DISTINCT(user_id) FROM seen_its WHERE EXISTS (SELECT 1 FROM blocks WHERE blocks.user_id = seen_its.user_id) AND seen_its.created BETWEEN (now()::date - interval '8 days')::timestamp AND now()::date::timestamp

--
Ants Aasma