Thread: break table into portions for writing to separate files
Hello, I've been looking for a way to write a table into multiple files, and am wondering if there are some clever suggestions. Say we have a table that is too large (several Gb) to write to a file that can be used for further analyses in other languages. The table consists of a timestamp field and several numeric fields, with records every 10th of a second. It could be meaningfully broken down into subsets of say 20 minutes worth of records. One option is to write a shell script that loops through the timestamp, selects the corresponding subset of the table, and writes it as a unique file. However, this would be extremely slow because each select takes several hours, and there can be hundreds of subsets. Is there a better way? Cheers, -- Seb
Hi,
Does the below kind of approach work for you. I haven't tested this, but would like to give an idea something like below.prev_interval := '0'::interval;
LOOP
IF ( "From Date" + "Interval" <= "To Date") THEN
EXECUTE FORMAT (
$$
COPY (SELECT <Columns Llist> FROM <tableName> WHERE timestamp_column >=%s AND timestamp_column<%s) TO '%s.csv'
$$,
("From Date" + "prev_interval")::TEXT,
("From Date" + "Interval") ::TEXT,
( Filename || (Extract(Epoch from interval)/60)::BIGINT)::TEXT);
$$
COPY (SELECT <Columns Llist> FROM <tableName> WHERE timestamp_column >=%s AND timestamp_column<%s) TO '%s.csv'
$$,
("From Date" + "prev_interval")::TEXT,
("From Date" + "Interval") ::TEXT,
( Filename || (Extract(Epoch from interval)/60)::BIGINT)::TEXT);
prev_interval := "Interval";
"Interval" := "Interval" + "Interval";
"Interval" := "Interval" + "Interval";
ELSE
EXIT FROM LOOP;
END IF;
END LOOP;
On Thu, May 1, 2014 at 11:20 PM, Seb <spluque@gmail.com> wrote:
Hello,
I've been looking for a way to write a table into multiple files, and am
wondering if there are some clever suggestions. Say we have a table
that is too large (several Gb) to write to a file that can be used for
further analyses in other languages. The table consists of a timestamp
field and several numeric fields, with records every 10th of a second.
It could be meaningfully broken down into subsets of say 20 minutes
worth of records. One option is to write a shell script that loops
through the timestamp, selects the corresponding subset of the table,
and writes it as a unique file. However, this would be extremely slow
because each select takes several hours, and there can be hundreds of
subsets. Is there a better way?
Cheers,
--
Seb
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi: On Thu, May 1, 2014 at 7:50 PM, Seb <spluque@gmail.com> wrote: > I've been looking for a way to write a table into multiple files, and am > wondering if there are some clever suggestions. Say we have a table > that is too large (several Gb) to write to a file that can be used for > further analyses in other languages. The table consists of a timestamp > field and several numeric fields, with records every 10th of a second. > It could be meaningfully broken down into subsets of say 20 minutes > worth of records. One option is to write a shell script that loops > through the timestamp, selects the corresponding subset of the table, > and writes it as a unique file. However, this would be extremely slow > because each select takes several hours, and there can be hundreds of > subsets. Is there a better way? As you mention looping and a shell, I suppose you are in something unix like, with pipes et al. You can pipe COPY ( either with the pipe options for copy, or piping a psql command, or whichever thing you like ) through a script which spits ecah data chunk into its corresponding file. If your data is somehow clustered into the table ( by chance or by design ) you don't even need a to sort the data, just use an open file pool, I did that once with call files, chunked them into day sized files and it worked like a charm ( and if you need the files sorted, you can then use sort on each of them, which normally is quite fast ). For your description of data, with a pipe, you could read a line, extract a key for the record ( the timestamp rounded down to 20 minutes would be a good one ), get and open output file for append ( using a small caching layer ) , write it. Depending on how many files you expect, how many RAM you have and how many files your OS allows you to open, other solutions exists. And if you do not have enough ram / openfiles / clustering for any of them there are multitude of tricks ( if, say, you have 3 years worth, no correlation, and can only open/buffer about 1000 files you could split from the db into day sized chunks and then split each of them into 20 minutes ones. Regards. Francisco Olarte.
Hi,
several Gb is about 1GB, that's not too much. In case you meant 'several GB', that shouldn't be a problem as well.The first thing I'd do would be creating an index on the column used for dividing the data. Then I'd just use the command COPY with a proper select to save the data to a file.
If each select lasts for several hours, make the select faster. Good index usually helps. You can also post here the query which lasts for too long, and attach its plan as well.
regards,
Szymon
On 1 May 2014 19:50, Seb <spluque@gmail.com> wrote:
Hello,
I've been looking for a way to write a table into multiple files, and am
wondering if there are some clever suggestions. Say we have a table
that is too large (several Gb) to write to a file that can be used for
further analyses in other languages. The table consists of a timestamp
field and several numeric fields, with records every 10th of a second.
It could be meaningfully broken down into subsets of say 20 minutes
worth of records. One option is to write a shell script that loops
through the timestamp, selects the corresponding subset of the table,
and writes it as a unique file. However, this would be extremely slow
because each select takes several hours, and there can be hundreds of
subsets. Is there a better way?
Cheers,
--
Seb
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Thu, 1 May 2014 20:20:23 +0200, Francisco Olarte <folarte@peoplecall.com> wrote: [...] > As you mention looping and a shell, I suppose you are in something > unix like, with pipes et al. You can pipe COPY ( either with the pipe > options for copy, or piping a psql command, or whichever thing you > like ) through a script which spits ecah data chunk into its > corresponding file. If your data is somehow clustered into the table ( > by chance or by design ) you don't even need a to sort the data, just > use an open file pool, I did that once with call files, chunked them > into day sized files and it worked like a charm ( and if you need the > files sorted, you can then use sort on each of them, which normally is > quite fast ). > For your description of data, with a pipe, you could read a line, > extract a key for the record ( the timestamp rounded down to 20 > minutes would be a good one ), get and open output file for append ( > using a small caching layer ) , write it. > Depending on how many files you expect, how many RAM you have and how > many files your OS allows you to open, other solutions exists. And if > you do not have enough ram / openfiles / clustering for any of them > there are multitude of tricks ( if, say, you have 3 years worth, no > correlation, and can only open/buffer about 1000 files you could split > from the db into day sized chunks and then split each of them into 20 > minutes ones. Thanks, I'm glad to hear you've used this approach successfully. It seems as though the best solution is to do a single SELECT to get the data out of the server (it is a view with a very complex query plan joining several other similar views), and then pipe the output through say awk to break down into chunks for writing the files, as you describe. Cheers, -- Seb
On Thu, 1 May 2014 20:22:26 +0200, Szymon Guz <mabewlun@gmail.com> wrote: > Hi, several Gb is about 1GB, that's not too much. In case you meant > 'several GB', that shouldn't be a problem as well. Sorry, I meant several GB. Although that may not be a problem for PostgreSQL, it is for post-processing the output file with other tools. > The first thing I'd do would be creating an index on the column used > for dividing the data. Then I'd just use the command COPY with a > proper select to save the data to a file. I should have mentioned that this is quite a complex view (not a table), which joins several other views of similar complexity. I'm not sure whether indexes are useful/feasible in this case. I'll investigate. Thanks, -- Seb
On 01/05/14 19:50, Seb wrote: > Hello, > > I've been looking for a way to write a table into multiple files, and am > wondering if there are some clever suggestions. Say we have a table > that is too large (several Gb) to write to a file that can be used for > further analyses in other languages. The table consists of a timestamp > field and several numeric fields, with records every 10th of a second. > It could be meaningfully broken down into subsets of say 20 minutes > worth of records. One option is to write a shell script that loops > through the timestamp, selects the corresponding subset of the table, > and writes it as a unique file. However, this would be extremely slow > because each select takes several hours, and there can be hundreds of > subsets. Is there a better way? # copy (select * from generate_series(1,1000)) to program 'split -l 100 - /tmp/xxx'; COPY 1000 # \q $ ls -l /tmp/xxxa* -rw------- 1 postgres postgres 292 May 1 19:08 /tmp/xxxaa -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxab -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxac -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxad -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxae -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxaf -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxag -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxah -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxai -rw------- 1 postgres postgres 401 May 1 19:08 /tmp/xxxaj Each of those contains 100 lines. Torsten
On Thu, 01 May 2014 21:12:46 +0200, Torsten Förtsch <torsten.foertsch@gmx.net> wrote: [...] > # copy (select * from generate_series(1,1000)) to program 'split -l > 100 - /tmp/xxx'; COPY 1000 # \q > $ ls -l /tmp/xxxa* -rw------- 1 postgres postgres 292 May 1 19:08 > /tmp/xxxaa -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxab > -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxac -rw------- 1 > postgres postgres 400 May 1 19:08 /tmp/xxxad -rw------- 1 postgres > postgres 400 May 1 19:08 /tmp/xxxae -rw------- 1 postgres postgres 400 > May 1 19:08 /tmp/xxxaf -rw------- 1 postgres postgres 400 May 1 19:08 > /tmp/xxxag -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxah > -rw------- 1 postgres postgres 400 May 1 19:08 /tmp/xxxai -rw------- 1 > postgres postgres 401 May 1 19:08 /tmp/xxxaj > Each of those contains 100 lines. Wonderful! I didn't know about this 'PROGRAM' parameter for COPY nowadays. Although the SELECT is slow, the split will happen very quickly this way, so this should be acceptable. Thanks, -- Seb
On 1 May 2014 21:01, Seb <spluque@gmail.com> wrote:
On Thu, 1 May 2014 20:22:26 +0200,Sorry, I meant several GB. Although that may not be a problem for
Szymon Guz <mabewlun@gmail.com> wrote:
> Hi, several Gb is about 1GB, that's not too much. In case you meant
> 'several GB', that shouldn't be a problem as well.
PostgreSQL, it is for post-processing the output file with other tools.I should have mentioned that this is quite a complex view (not a table),
> The first thing I'd do would be creating an index on the column used
> for dividing the data. Then I'd just use the command COPY with a
> proper select to save the data to a file.
which joins several other views of similar complexity. I'm not sure
whether indexes are useful/feasible in this case. I'll investigate.
Yes, indexes can be used to speed up the view as well. Such a view is nothing more than just a query.
regards,
Szymon
On Thu, 1 May 2014 22:17:24 +0200, Szymon Guz <mabewlun@gmail.com> wrote: > On 1 May 2014 21:01, Seb <spluque@gmail.com> wrote: > On Thu, 1 May 2014 20:22:26 +0200, > Szymon Guz <mabewlun@gmail.com> wrote: >> Hi, several Gb is about 1GB, that's not too much. In case you meant >> 'several GB', that shouldn't be a problem as well. > Sorry, I meant several GB. Although that may not be a problem for > PostgreSQL, it is for post-processing the output file with other > tools. >> The first thing I'd do would be creating an index on the column used >> for dividing the data. Then I'd just use the command COPY with a >> proper select to save the data to a file. > I should have mentioned that this is quite a complex view (not a > table), which joins several other views of similar complexity. I'm > not sure whether indexes are useful/feasible in this case. I'll > investigate. > Yes, indexes can be used to speed up the view as well. Such a view is > nothing more than just a query. Help for CREATE INDEX says that its target should be a table or materialized view, so I'm guessing you mean indexes on the relevant fields of the underlying tables. In that case, I already have indexes on those, especially the timestamp fields which are the ones that are used for the heavy query work. Thanks, -- Seb
On 1 May 2014 22:24, Seb <spluque@gmail.com> wrote:
On Thu, 1 May 2014 22:17:24 +0200,Szymon Guz <mabewlun@gmail.com> wrote:Help for CREATE INDEX says that its target should be a table or
> On 1 May 2014 21:01, Seb <spluque@gmail.com> wrote:
> On Thu, 1 May 2014 20:22:26 +0200,
> Szymon Guz <mabewlun@gmail.com> wrote:
>> Hi, several Gb is about 1GB, that's not too much. In case you meant
>> 'several GB', that shouldn't be a problem as well.
> Sorry, I meant several GB. Although that may not be a problem for
> PostgreSQL, it is for post-processing the output file with other
> tools.
>> The first thing I'd do would be creating an index on the column used
>> for dividing the data. Then I'd just use the command COPY with a
>> proper select to save the data to a file.
> I should have mentioned that this is quite a complex view (not a
> table), which joins several other views of similar complexity. I'm
> not sure whether indexes are useful/feasible in this case. I'll
> investigate.
> Yes, indexes can be used to speed up the view as well. Such a view is
> nothing more than just a query.
materialized view, so I'm guessing you mean indexes on the relevant
fields of the underlying tables. In that case, I already have indexes
on those, especially the timestamp fields which are the ones that are
used for the heavy query work.
Can you show us the query plan for the queries you are using, the view definition, and how you query that view?
Szymon
On Thu, 1 May 2014 22:31:46 +0200, Szymon Guz <mabewlun@gmail.com> wrote: [...] > Can you show us the query plan for the queries you are using, the view > definition, and how you query that view? Thanks for your help with this. Here's the view definition (eliding similar column references): ---<--------------------cut here---------------start------------------->--- CREATE OR REPLACE VIEW public.daily_flux_2013 AS WITH pre AS ( SELECT mot.time_study, ..., wind.wind_speed_u, ..., op.op_analyzer_status, ..., count(wind.wind_speed_u) OVER w AS nwind FROM daily_motion_2013 mot JOIN daily_wind3d_analog_2013 wind USING (time_study) JOIN daily_opath_2013 op USING (time_study) JOIN ( SELECT generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp withouttime zone, '00:20:00'::interval) AS time_20min) ts_20min ON mot.time_study >= ts_20min.time_20min AND mot.time_study< (ts_20min.time_20min + '00:20:00'::interval) WINDOW w AS (PARTITION BY ts_20min.time_20min ORDER BY ts_20min.time_20min) ) SELECT pre.time_study, ..., FROM pre WHERE pre.nwind = 12000 ORDER BY pre.time_study; ---<--------------------cut here---------------end--------------------->--- Here, mot, wind, and op are views that are similarly constructed (i.e. they use generate_series () and join a few tables). The WHERE clause is used to output only 20 minute periods where every 0.1 second record is available (as determined by the wind_speed_u column). I'm SELECT'ing this view simply as 'SELECT * FROM daily_flux_2013', and that shows this query plan (lengthy output from pgadmin's): ---<--------------------cut here---------------start------------------->--- "Sort (cost=29182411.29..29182411.57 rows=111 width=976)" " Sort Key: pre.time_study" " CTE pre" " -> WindowAgg (cost=29181518.64..29181907.52 rows=22222 width=434)" " -> Sort (cost=29181518.64..29181574.19 rows=22222 width=434)" " Sort Key: (generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestampwithout time zone, '00:20:00'::interval))" " -> Nested Loop (cost=22171519.20..29179914.24 rows=22222 width=434)" " Join Filter: (((generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestampwithout time zone, '00:00:00.1'::interval)) >= (generate_series('2013-07-28 00:00:00'::timestamp withouttime zone, ' (...)" " -> Result (cost=0.00..5.01 rows=1000 width=0)" " -> Materialize (cost=22171519.20..29175899.74 rows=200 width=426)" " -> Merge Join (cost=22171519.20..29175898.74 rows=200 width=426)" " Merge Cond: ((generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-1300:00:00'::timestamp without time zone, '00:00:00.1'::interval)) = (generate_series('2013-07-28 00:00:00'::timestampwithout tim (...)" " -> Merge Join (cost=9360527.55..12865370.87 rows=200 width=123)" " Merge Cond: ((generate_series('2013-07-28 00:00:00'::timestamp without time zone,'2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)) = (generate_series('2013-07-28 00:00:00'::timestampwitho (...)" " -> Unique (cost=8625.16..8626.84 rows=200 width=50)" " -> Sort (cost=8625.16..8626.00 rows=337 width=50)" " Sort Key: (generate_series('2013-07-28 00:00:00'::timestamp withouttime zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval))" " -> Nested Loop (cost=0.57..8611.01 rows=337 width=50)" " -> Result (cost=0.00..5.01 rows=1000 width=0)" " -> Index Scan using motion_series_time_idx on motion_series mot (cost=0.57..8.59 rows=1 width=50)" " Index Cond: ("time" = (generate_series('2013-07-28 00:00:00'::timestampwithout time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)))" " Filter: (logging_group_id = 33)" " -> Materialize (cost=9351902.39..12856739.03 rows=200 width=73)" " -> Unique (cost=9351902.39..12856736.53 rows=200 width=73)" " -> Merge Join (cost=9351902.39..12369954.70 rows=194712730 width=73)" " Merge Cond: ((generate_series('2013-07-28 00:00:00'::timestampwithout time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)) = w."time")" " -> Sort (cost=64.84..67.34 rows=1000 width=8)" " Sort Key: (generate_series('2013-07-28 00:00:00'::timestampwithout time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval))" " -> Result (cost=0.00..5.01 rows=1000 width=0)" " -> Materialize (cost=9351837.55..9546550.28 rows=38942546 width=73)" " -> Sort (cost=9351837.55..9449193.92 rows=38942546 width=73)" " Sort Key: w."time"" " -> Append (cost=0.00..2711828.47 rows=38942546 width=73)" " -> Seq Scan on wind3d_series w (cost=0.00..0.00rows=1 width=236)" " Filter: (((stream_type)::text = 'analog'::text)AND (logging_group_id = 33))" " -> Bitmap Heap Scan on wind3d_series_analogw_1 (cost=728917.29..2711828.47 rows=38942545 width=73)" " Recheck Cond: (logging_group_id = 33)" " Filter: ((stream_type)::text = 'analog'::text)" " -> Bitmap Index Scan on fki_wind3d_series_analog_logging_group_id_fkey (cost=0.00..719181.65 rows=38942545 width=0)" " Index Cond: (logging_group_id =33)" " -> Materialize (cost=12810991.66..16310524.87 rows=200 width=319)" " -> Unique (cost=12810991.66..16310522.37 rows=200 width=319)" " -> Merge Join (cost=12810991.66..15824477.13 rows=194418095 width=319)" " Merge Cond: ((generate_series('2013-07-28 00:00:00'::timestamp withouttime zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)) = op."time")" " -> Sort (cost=64.84..67.34 rows=1000 width=8)" " Sort Key: (generate_series('2013-07-28 00:00:00'::timestamp withouttime zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval))" " -> Result (cost=0.00..5.01 rows=1000 width=0)" " -> Materialize (cost=12810926.82..13005344.91 rows=38883619 width=319)" " -> Sort (cost=12810926.82..12908135.87 rows=38883619 width=319)" " Sort Key: op."time"" " -> Append (cost=0.00..2194298.15 rows=38883619 width=319)" " -> Seq Scan on open_path_series op (cost=0.00..0.00rows=1 width=556)" " Filter: ((NOT is_shroud) AND (logging_group_id= 33))" " -> Index Scan using fki_open_path_series_noshroud_logging_group_id_fkeyon open_path_series_noshroud op_1 (cost=0.57..2194298.15 rows=38883618width=319)" " Index Cond: (logging_group_id = 33)" " Filter: (NOT is_shroud)" " -> CTE Scan on pre (cost=0.00..500.00 rows=111 width=976)" " Filter: (nwind = 12000)" ---<--------------------cut here---------------end--------------------->--- -- Seb
On 1 May 2014 22:50, Seb <spluque@gmail.com> wrote:
On Thu, 1 May 2014 22:31:46 +0200,
Szymon Guz <mabewlun@gmail.com> wrote:
[...]Thanks for your help with this. Here's the view definition (eliding
> Can you show us the query plan for the queries you are using, the view
> definition, and how you query that view?
similar column references):
---<--------------------cut here---------------start------------------->---
CREATE OR REPLACE VIEW public.daily_flux_2013 AS
WITH pre AS (
SELECT mot.time_study, ...,
wind.wind_speed_u, ...,
op.op_analyzer_status, ...,
count(wind.wind_speed_u) OVER w AS nwind
FROM daily_motion_2013 mot
JOIN daily_wind3d_analog_2013 wind USING (time_study)
JOIN daily_opath_2013 op USING (time_study)
JOIN ( SELECT generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:20:00'::interval) AS time_20min) ts_20min ON mot.time_study >= ts_20min.time_20min AND mot.time_study < (ts_20min.time_20min + '00:20:00'::interval)
WINDOW w AS (PARTITION BY ts_20min.time_20min ORDER BY ts_20min.time_20min)
)
SELECT pre.time_study, ...,
FROM pre
WHERE pre.nwind = 12000
ORDER BY pre.time_study;
---<--------------------cut here---------------end--------------------->---
Here, mot, wind, and op are views that are similarly constructed
(i.e. they use generate_series () and join a few tables). The WHERE
clause is used to output only 20 minute periods where every 0.1 second
record is available (as determined by the wind_speed_u column).
I'm SELECT'ing this view simply as 'SELECT * FROM daily_flux_2013', and
that shows this query plan (lengthy output from pgadmin's):
---<--------------------cut here---------------start------------------->---
"Sort (cost=29182411.29..29182411.57 rows=111 width=976)"
" Sort Key: pre.time_study"
" CTE pre"
" -> WindowAgg (cost=29181518.64..29181907.52 rows=22222 width=434)"
" -> Sort (cost=29181518.64..29181574.19 rows=22222 width=434)"
" Sort Key: (generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:20:00'::interval))"
" -> Nested Loop (cost=22171519.20..29179914.24 rows=22222 width=434)"
" Join Filter: (((generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)) >= (generate_series('2013-07-28 00:00:00'::timestamp without time zone, ' (...)"
" -> Result (cost=0.00..5.01 rows=1000 width=0)"
" -> Materialize (cost=22171519.20..29175899.74 rows=200 width=426)"
" -> Merge Join (cost=22171519.20..29175898.74 rows=200 width=426)"
" Merge Cond: ((generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)) = (generate_series('2013-07-28 00:00:00'::timestamp without tim (...)"
" -> Merge Join (cost=9360527.55..12865370.87 rows=200 width=123)"
" Merge Cond: ((generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)) = (generate_series('2013-07-28 00:00:00'::timestamp witho (...)"
" -> Unique (cost=8625.16..8626.84 rows=200 width=50)"
" -> Sort (cost=8625.16..8626.00 rows=337 width=50)"
" Sort Key: (generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval))"
" -> Nested Loop (cost=0.57..8611.01 rows=337 width=50)"
" -> Result (cost=0.00..5.01 rows=1000 width=0)"
" -> Index Scan using motion_series_time_idx on motion_series mot (cost=0.57..8.59 rows=1 width=50)"
" Index Cond: ("time" = (generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)))"
" Filter: (logging_group_id = 33)"
" -> Materialize (cost=9351902.39..12856739.03 rows=200 width=73)"
" -> Unique (cost=9351902.39..12856736.53 rows=200 width=73)"
" -> Merge Join (cost=9351902.39..12369954.70 rows=194712730 width=73)"
" Merge Cond: ((generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)) = w."time")"
" -> Sort (cost=64.84..67.34 rows=1000 width=8)"
" Sort Key: (generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval))"
" -> Result (cost=0.00..5.01 rows=1000 width=0)"
" -> Materialize (cost=9351837.55..9546550.28 rows=38942546 width=73)"
" -> Sort (cost=9351837.55..9449193.92 rows=38942546 width=73)"
" Sort Key: w."time""
" -> Append (cost=0.00..2711828.47 rows=38942546 width=73)"
" -> Seq Scan on wind3d_series w (cost=0.00..0.00 rows=1 width=236)"
" Filter: (((stream_type)::text = 'analog'::text) AND (logging_group_id = 33))"
" -> Bitmap Heap Scan on wind3d_series_analog w_1 (cost=728917.29..2711828.47 rows=38942545 width=73)"
" Recheck Cond: (logging_group_id = 33)"
" Filter: ((stream_type)::text = 'analog'::text)"
" -> Bitmap Index Scan on fki_wind3d_series_analog_logging_group_id_fkey (cost=0.00..719181.65 rows=38942545 width=0)"
" Index Cond: (logging_group_id = 33)"
" -> Materialize (cost=12810991.66..16310524.87 rows=200 width=319)"
" -> Unique (cost=12810991.66..16310522.37 rows=200 width=319)"
" -> Merge Join (cost=12810991.66..15824477.13 rows=194418095 width=319)"
" Merge Cond: ((generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval)) = op."time")"
" -> Sort (cost=64.84..67.34 rows=1000 width=8)"
" Sort Key: (generate_series('2013-07-28 00:00:00'::timestamp without time zone, '2013-09-13 00:00:00'::timestamp without time zone, '00:00:00.1'::interval))"
" -> Result (cost=0.00..5.01 rows=1000 width=0)"
" -> Materialize (cost=12810926.82..13005344.91 rows=38883619 width=319)"
" -> Sort (cost=12810926.82..12908135.87 rows=38883619 width=319)"
" Sort Key: op."time""
" -> Append (cost=0.00..2194298.15 rows=38883619 width=319)"
" -> Seq Scan on open_path_series op (cost=0.00..0.00 rows=1 width=556)"
" Filter: ((NOT is_shroud) AND (logging_group_id = 33))"
" -> Index Scan using fki_open_path_series_noshroud_logging_group_id_fkey on open_path_series_noshroud op_1 (cost=0.57..2194298.15 rows=38883618 width=319)"
" Index Cond: (logging_group_id = 33)"
" Filter: (NOT is_shroud)"
" -> CTE Scan on pre (cost=0.00..500.00 rows=111 width=976)"
" Filter: (nwind = 12000)"
---<--------------------cut here---------------end--------------------->---
--
Seb
In this form it is quite unreadible. Could you paste the plan to the http://explain.depesz.com/ and provide her an url of the page?
thanks,
Szymon
thanks,
Szymon
On Thu, 1 May 2014 23:41:04 +0200, Szymon Guz <mabewlun@gmail.com> wrote: [...] > In this form it is quite unreadible. Could you paste the plan to the > http://explain.depesz.com/ and provide her an url of the page? Nice. http://explain.depesz.com/s/iMJi -- Seb
Hi Seb: On Thu, May 1, 2014 at 8:50 PM, Seb <spluque@gmail.com> wrote: > Thanks, I'm glad to hear you've used this approach successfully. Well, this is always successful if you are able to develop a moderately complex script. > It seems as though the best solution is to do a single SELECT to get the > data out of the server (it is a view with a very complex query plan > joining several other similar views), and then pipe the output through > say awk to break down into chunks for writing the files, as you > describe. If you've got enough disk in the target machine ( several Gb, even a hundred, shouldn't be a problem with todays, or even yesterdays machines ) for two copies of the data, I'll recommend to just do a COPY of the view to a file, then process the file. This lets you do very simple scripts to chunk it, and normally workstation disk is way cheaper than servers. What I would normally do from what you describe is to spool the whole table ( maybe gzipping it along the way if it's real big, in my experience some current machines ( fast multicores with not so fast disks ) are faster gziping and reading/writing a third of the data ( easily achievable with gzip if your data are numbers and timestamps ) than writing the full set, and then make a set of scripts which work on it. I do not grok awk ( I began using perl in the 4.019 era and substituted awk/sed with it for all kind of one liners and similars ), but with perl, on a moderate machine ( like the one I'm using, core duo with 2Gb, quite old by today standards ) you can filter the data and extract it to several hundreds files at nearly full disk speed. As I commented before, if your query results exhibit some locality on the key, you can open several files keeping an LRU cache and split it in one go. I wouldn't try to force order in the db, I've found disk sorts are better for this kind of problems. Ah, another hint. I've had to make somehow similar tasks ( dumping a big query which joins a lot ) in the past. In some of this cases the result was sweeping and concatenating a slew of really big tables while joining a lot of small ones, something like sweeping 20Gb while joining it with 15 more totalling 512Mb among them, generating a 100Gb denormalized result. In these cases I developed a program which just slurped all the small tables into RAM and did the join before writing, which greatly improved the speed ( as, among other things, Postgres only had to send me 20.5Gb over the net ). Sometimes you can use things like this to improve performance as WS RAM is sometimes plentiful and cheap, as you have all of it for a single task, while db server ram is scarcer. Francisco Olarte.