Re: Fwd: Raster performance - Mailing list pgsql-general

From PT
Subject Re: Fwd: Raster performance
Date
Msg-id 20150529164435.e218889770d67346ecf72e38@potentialtech.com
Whole thread Raw
In response to Re: Fwd: Raster performance  (David Haynes II <dahaynes@umn.edu>)
List pgsql-general
On Fri, 29 May 2015 13:37:24 -0500
David Haynes II <dahaynes@umn.edu> wrote:
>
> Let me explain this a bit more clearly. The dataset (table) that we are
> using in this analysis is produced from a satellite image (tif) called
> MODIS. This image can be placed into PostgreSQL using the raster2pgsql
> command (http://postgis.net/docs/using_raster_dataman.html). The
> raster2pgsql command reads the tif image and creates a table that can be
> used with the PostGIS extensions. There is no difference between the
> modis_noout and modis, what has happened is that we have used the
> raster2pgsl -R flag in the conversion process. The -R flag creates metadata
> for the raster file. For example, the modis dataset as tif is 46 GB, the
> same dataset stored as table within the database is 1.74 GB and the using
> outdb the metadata table is 52 MB.

What is the output of:
SELECT pg_total_relation_size('modis') AS "modis",
       pg_total_relation_size('modis_noout') AS "modis_noout";
on both indb and outdb?

> We have seen a decrease in performance time when using OutDB, keep in mind
> that the schema and indices on the tables are exactly the same.
>
> CREATE TABLE modis (
>    rid integer NOT NULL,
>    rast public.raster,
>    filename text,
>    CONSTRAINT enforce_height_rast CHECK ((public.st_height(rast) = 250)),
>    CONSTRAINT enforce_nodata_values_rast CHECK
> (((public._raster_constraint_nodata_values(rast))::numeric(16,10)[] =
> '{NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL}'::numeric(16,10)[])),
>    CONSTRAINT enforce_num_bands_rast CHECK ((public.st_numbands(rast) =
> 12)),
>    CONSTRAINT enforce_out_db_rast CHECK
> ((public._raster_constraint_out_db(rast) =
> '{t,t,t,t,t,t,t,t,t,t,t,t}'::boolean[])),
>    CONSTRAINT enforce_pixel_types_rast CHECK
> ((public._raster_constraint_pixel_types(rast) =
>  '{8BUI,8BUI,8BUI,8BUI,8BUI,8BUI,8BUI,8BUI,8BUI,8BUI,8BUI,8BUI}'::text[])),
>    CONSTRAINT enforce_same_alignment_rast CHECK
> (public.st_samealignment(rast,
>
'01000000002EC50BE300F57C4026C50BE300F57CC0E7FBA955801673C13202AA558016634100000000000000000000000000000000BA1A000001000100'::public.raster)
> ),
>    CONSTRAINT enforce_scalex_rast CHECK
> (((public.st_scalex(rast))::numeric(16,10) =
> 463.312716527917::numeric(16,10))),
>    CONSTRAINT enforce_scaley_rast CHECK
> (((public.st_scaley(rast))::numeric(16,10) =
> (-463.312716527917)::numeric(16,10))),
>    CONSTRAINT enforce_srid_rast CHECK ((public.st_srid(rast) = 6842)),
>    CONSTRAINT enforce_width_rast CHECK ((public.st_width(rast) = 250))
> );
> CREATE SEQUENCE modis_rid_seq
>    START WITH 1
>    INCREMENT BY 1
>    NO MINVALUE
>    NO MAXVALUE
>    CACHE 1;
> ALTER TABLE ONLY modis ALTER COLUMN rid SET DEFAULT
> nextval('modis_rid_seq'::regclass);
> ALTER TABLE ONLY modis ADD CONSTRAINT modis_pkey PRIMARY KEY (rid);
> CREATE INDEX modis_rast_gist ON modis USING gist
> (public.st_convexhull(rast));
>
>
> Here is the explain analyze
>
> InDB
> Sort  (cost=69547.29..69562.80 rows=6204 width=254) (actual
> time=131042.478..131042.877 rows=7612 loops=1)
>   Sort Key: r.id, r.name
>   Sort Method: quicksort  Memory: 787kB
>   CTE poly
>     ->  Seq Scan on us_counties  (cost=0.00..112.86 rows=3109 width=62247)
> (actual time=3.008..8135.679 rows=3109 loops=1)
>   CTE rast_select
>     ->  Nested Loop  (cost=0.28..51767.41 rows=62033 width=272) (actual
> time=170.799..128606.266 rows=7677 loops=1)
>           ->  CTE Scan on poly s  (cost=0.00..62.18 rows=3109 width=250)
> (actual time=3.058..8212.030 rows=3109 loops=1)
>           ->  Index Scan using modis_noout_rast_gist on modis_noout r_1
>  (cost=0.28..16.56 rows=2 width=22) (actual time=4.329..11.244 rows=2
> loops=3109)
>                 Index Cond: ((rast)::geometry && s.geom)
>                 Filter: _st_intersects(s.geom, rast, NULL::integer)
>                 Rows Removed by Filter: 0
>   ->  HashAggregate  (cost=17214.16..17276.20 rows=6204 width=254) (actual
> time=131033.902..131036.105 rows=7612 loops=1)
>         ->  CTE Scan on rast_select r  (cost=0.00..1240.66 rows=62033
> width=254) (actual time=170.814..128659.842 rows=7677 loops=1)
> Total runtime: 131414.752 ms
>
> OutDB
> Sort  (cost=93911.29..93926.80 rows=6204 width=254) (actual
> time=866326.762..866327.148 rows=7612 loops=1)
>   Sort Key: r.id, r.name
>   Sort Method: quicksort  Memory: 787kB
>   CTE poly
>     ->  Seq Scan on us_counties  (cost=0.00..112.86 rows=3109 width=62247)
> (actual time=1.327..6954.020 rows=3109 loops=1)
>   CTE rast_select
>     ->  Nested Loop  (cost=0.28..76131.41 rows=62033 width=1086) (actual
> time=257.610..863474.778 rows=7677 loops=1)
>           ->  CTE Scan on poly s  (cost=0.00..62.18 rows=3109 width=250)
> (actual time=1.341..7030.138 rows=3109 loops=1)
>           ->  Index Scan using modis_rast_gist on modis r_1
>  (cost=0.28..24.40 rows=2 width=836) (actual time=1.481..3.952 rows=2
> loops=3109)
>                 Index Cond: ((rast)::geometry && s.geom)
>                 Filter: _st_intersects(s.geom, rast, NULL::integer)
>                 Rows Removed by Filter: 0
>   ->  HashAggregate  (cost=17214.16..17276.20 rows=6204 width=254) (actual
> time=866317.923..866320.316 rows=7612 loops=1)
>         ->  CTE Scan on rast_select r  (cost=0.00..1240.66 rows=62033
> width=254) (actual time=257.625..863555.082 rows=7677 loops=1)
> Total runtime: 866691.113 ms
>
> On Fri, May 29, 2015 at 6:40 AM, PT <wmoran@potentialtech.com> wrote:
>
> > On Thu, 28 May 2015 10:06:24 -0500
> > David Haynes II <dahaynes@umn.edu> wrote:
> > >
> > > The query run times are significantly slower on outdb as that using indb
> > > here are the run times on 2 queries.
> > >
> > > ST_Count/ST_Clip(select single band here)/Inner Join/ST_Transform (US
> > > Counties)
> > > OutDB: 873.564s (14 minutes 33s)     InDB:  127.36s (2 minutes 7s)
> > >
> > > ST_Count(select single band here)/ST_Clip(on all bands)/Inner
> > > Join/ST_Transform (US Counties)
> > > OutDB: 9537.371s (2 hours 38minutes)   InDB:  310s (5 minutes 10 seconds)
> > >
> > > In the query planner it shows a large change in the number of columns
> > > (width) that are picked up in the CTE_rast_select.
> > > These extra columns slow down the ability to process the data.
> >
> > As I mentioned before, that's not how that works. Those are estimates. If
> > those rows are actually different between the two tables, then your data
> > is different between the two databases and you can't expect the performance
> > to be the same.
> >
> > Additionally, the part you snip out below isn't the part that's different.
> > In particular, the difference is coming from the fact that one of the
> > plans uses modis and the other uses modis_noout.
> >
> > Does modis exist in indb? Does modis_noout exist on outdb? What is the
> > difference between these two tables? Because _that_ is where the time
> > difference is most likely happening (based on the explain output).
> >
> > Additionally, run EXPLAIN ANALYZE on these queries to get the actual
> > times in addition to the estimates. Furthermore, given that a lot of
> > the confusion in this question is due to a lack of iformation, it would
> > be a good idea to include the table definitions.
> >
> > > OUT DB
> > > CTE rast_select
> > > >     ->  Nested Loop  (cost=0.28..76131.41 rows=62033 *width=1086)*
> > > >           ->  CTE Scan on poly s  (cost=0.00..62.18 rows=3109
> > width=250)
> > >
> > >
> > > In DB
> > > Nested Loop  (cost=0.28..51767.41 rows=62033 *width=272*)
> > > >           ->  CTE Scan on poly s  (cost=0.00..62.18 rows=3109
> > width=250)
> > >
> > > On Wed, May 27, 2015 at 4:31 PM, PT <wmoran@potentialtech.com> wrote:
> > >
> > > > On Tue, 26 May 2015 12:52:24 -0500
> > > > David Haynes II <dahaynes@umn.edu> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I have a question about the query optimizer and its performance on
> > > > spatial
> > > > > datasets, specifically rasters. My use case is rather unique, the
> > > > > application that I am developing allows users to request
> > summarizations
> > > > of
> > > > > various geographic boundaries around the world. Therefore our raster
> > > > > datasets are global. We are in the process of conducting some
> > benchmarks
> > > > > for our system and we noticed something unexpected.
> > > > >
> > > > > The query is the same except the first is run on a raster (46gigs)
> > in out
> > > > > of database (outdb) and the second is the same raster (46gigs)
> > stored in
> > > > > database (indb). The raster is multibanded (13), with each band
> > > > > representing one entire MODIS global scene. A single year of MODIS is
> > > > > approximately 3.6 gigs.
> > > > >
> > > > > The outdb is being out performed by indb, because the query optimizer
> > > > gets
> > > > > smarter. But what is also interesting is all the extra pieces that
> > are
> > > > > brought in with outdb.
> > > > >
> > > > > with poly as
> > > > > ( SELECT gid, label as name, ST_Transform(geom, 6842) as geom  FROM
> > > > > us_counties )
> > > > > , rast_select as
> > > > > ( SELECT r.rid as id, s.name , ST_CLIP(r.rast, s.geom) as rast from
> > > > > rasters.glc2000 as r inner join poly as s on ST_Intersects(r.rast,
> > > > s.geom) )
> > > > > select r.id, r.name, ST_Count(r.rast, 1, True)
> > > > >
> > > > >
> > > > >    QUERY PLAN With Outdb
> > > > >
> > > >
> > --------------------------------------------------------------------------------------------------
> > > > > Sort   (cost=93911.29..93926.80 rows=6204 width=254)
> > > > >   Sort Key: r.id, r.name
> > > > >   CTE poly
> > > > >     ->  Seq Scan on us_counties  (cost=0.00..112.86 rows=3109
> > > > width=62247)
> > > > >   CTE rast_select
> > > > >     ->  Nested Loop  (cost=0.28..76131.41 rows=62033 width=1086)
> > > > >           ->  CTE Scan on poly s  (cost=0.00..62.18 rows=3109
> > width=250)
> > > > >           ->  Index Scan using modis_rast_gist on modis r_1
> > > > >  (cost=0.28..24.40 rows=2 width=836)
> > > > >                 Index Cond: ((rast)::geometry && s.geom)
> > > > >                 Filter: _st_intersects(s.geom, rast, NULL::integer)
> > > > >   ->  HashAggregate  (cost=17214.16..17276.20 rows=6204 width=254)
> > > > >         ->  CTE Scan on rast_select r  (cost=0.00..1240.66 rows=62033
> > > > > width=254)
> > > > >
> > > > >                                                 QUERY PLAN With Indb
> > > > >
> > > > >
> > > >
> > -------------------------------------------------------------------------------------------------------------
> > > > > Sort   (cost=69547.29..69562.80 rows=6204 width=254)
> > > > >   Sort Key: r.id, r.name
> > > > >   CTE poly
> > > > >     ->  Seq Scan on us_counties  (cost=0.00..112.86 rows=3109
> > > > width=62247)
> > > > >   CTE rast_select
> > > > >     ->  Nested Loop  (cost=0.28..51767.41 rows=62033 width=272)
> > > > >           ->  CTE Scan on poly s  (cost=0.00..62.18 rows=3109
> > width=250)
> > > > >           ->  Index Scan using modis_noout_rast_gist on modis_noout
> > r_1
> > > > >  (cost=0.28..16.56 rows=2 width=22)
> > > > >                 Index Cond: ((rast)::geometry && s.geom)
> > > > >                 Filter: _st_intersects(s.geom, rast, NULL::integer)
> > > > >   ->  HashAggregate  (cost=17214.16..17276.20 rows=6204 width=254)
> > > > >         ->  CTE Scan on rast_select r  (cost=0.00..1240.66 rows=62033
> > > > > width=254)
> > > >
> > > > I could be missing something here, but I don't see how long the queries
> > > > actually take to run. Have you actually run the queries and timed them?
> > > > Keep in mind that analyze does not actually run the query, it only
> > plans
> > > > it, so the actual run time is unknown if all you do is analyze.
> > > >
> > > > The query plans appear to be equal, assuming there are slight variances
> > > > in the names of tables from one DB to another (and I assume that your
> > > > description of indb and outdb reflects the fact that there are (for
> > > > reasons unknown) two copies of the data).
> > > >
> > > > The only purpose to those estimates is to choose a good plan. If the
> > > > plan is bad for one database and both databases have the same data,
> > then
> > > > the plan will be bad for both.
> > > >
> > > > Since there have been no other responses, I'm guessing that others are
> > > > confused by your question as well. Can you describe the actual problem
> > > > that you're seeing?
> > > >
> > > > --
> > > > Bill Moran <wmoran@potentialtech.com>
> > > >
> > > >
> > > > --
> > > > Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> > > > To make changes to your subscription:
> > > > http://www.postgresql.org/mailpref/pgsql-general
> > > >
> > >
> > >
> > >
> > > --
> > > David Haynes, Ph.D.
> > > Research Associate Terra Populus
> > > Minnesota Population Center
> > > www.terrapop.org
> >
> >
> > --
> > Bill Moran <wmoran@potentialtech.com>
> >
>
>
>
> --
> David Haynes, Ph.D.
> Research Associate Terra Populus
> Minnesota Population Center
> www.terrapop.org


--
Bill Moran <wmoran@potentialtech.com>


pgsql-general by date:

Previous
From: PT
Date:
Subject: Re: Planner cost adjustments
Next
From: Daniel Begin
Date:
Subject: Re: Planner cost adjustments