Thread: Atomic access to large arrays
Hi
I'm storing historical meteorological gridded data from GFS (http://www.nco.ncep.noaa.gov/pmb/products/gfs/) into an array field in a table like this:
Dimensions of each value of field vl_grid are (lat:361 x lon:720 = 259920 cells} for a grid of 0.5 degrees (about each 55 Km) around the world. So, vl_grid[y][x] stores the value at dh_date of a meteorological variable specified by id_inventory in the geodesic point
The reverse formula for the closest point in the grid of an arbitary geodesic point will be
Field vl_grid is stored in the TOAST table and has a good compression level. PostgreSql is the only one database that is able to store this huge amount of data in only 34 GB of disk. It's really great system. Queries returning big rectangular areas are very fast, but the target of almost all queries is to get historical series for a geodesic point
In this case, atomic access to just a cell of each one of a only few thousands of rows becomes too slow.
Please, could somebody answer some of these questions?
--
Víctor de Buen Remiro
Tol Development Team member
www.tol-project.org
I'm storing historical meteorological gridded data from GFS (http://www.nco.ncep.noaa.gov/pmb/products/gfs/) into an array field in a table like this:
CREATE TABLE grid_f_data_i2 (
//Specifies the variable and other features of data
id_inventory integer REFERENCES grid_d_inventory(id_inventory),
//A new grid is available each 3 hours since 5 years ago
dh_date timestamp,
//Values are scaled to be stored as signed integers of 2 bytes
vl_grid smallint[361][720],
CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY
(co_inventory, dh_date)
);
//Specifies the variable and other features of data
id_inventory integer REFERENCES grid_d_inventory(id_inventory),
//A new grid is available each 3 hours since 5 years ago
dh_date timestamp,
//Values are scaled to be stored as signed integers of 2 bytes
vl_grid smallint[361][720],
CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY
(co_inventory, dh_date)
);
Dimensions of each value of field vl_grid are (lat:361 x lon:720 = 259920 cells} for a grid of 0.5 degrees (about each 55 Km) around the world. So, vl_grid[y][x] stores the value at dh_date of a meteorological variable specified by id_inventory in the geodesic point
latitude = -90 + y*0.5
longitude = x*0.5
longitude = x*0.5
The reverse formula for the closest point in the grid of an arbitary geodesic point will be
y = Round((latitude+90) * 2
x = Round(longitude*2)
x = Round(longitude*2)
Field vl_grid is stored in the TOAST table and has a good compression level. PostgreSql is the only one database that is able to store this huge amount of data in only 34 GB of disk. It's really great system. Queries returning big rectangular areas are very fast, but the target of almost all queries is to get historical series for a geodesic point
SELECT dh_date, vl_grid[123][152]
FROM grid_f_data_i2
WHERE id_inventory = 6
ORDER BY dh_date
FROM grid_f_data_i2
WHERE id_inventory = 6
ORDER BY dh_date
In this case, atomic access to just a cell of each one of a only few thousands of rows becomes too slow.
Please, could somebody answer some of these questions?
- It's posible to tune some TOAST parameters to get faster atomic access to large arrays?
- Using "EXTERNAL" strategy for storing TOAST-able columns could solve the problem?
- Atomic access will be faster if table stores vectors for data in the same parallel instead of matrices of global data?
CREATE TABLE grid_f_data_i2 (
//Specifies the variable and other features of data
id_inventory integer REFERENCES grid_d_inventory(id_inventory),
//A new grid is available each 3 hours since 5 years ago
dh_date timestamp,
// nu_parallel = y = Round((latitude+90) * 2
smallint nu_parallel,
//Values are scaled to be stored as signed integers of 2 bytes
vl_parallel smallint[],
CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY
(co_inventory, nu_parallel, dh_date)
); - There is another faster solution?
--
Víctor de Buen Remiro
Tol Development Team member
www.tol-project.org
Victor, Just wondering why do you use array ? Oleg On Wed, 22 Jul 2009, Victor de Buen (Bayes) wrote: > Hi > > I'm storing historical meteorological gridded data from GFS ( > http://www.nco.ncep.noaa.gov/pmb/products/gfs/) into an array field in a > table like this: > > CREATE TABLE grid_f_data_i2 ( > //Specifies the variable and other features of data > id_inventory integer REFERENCES grid_d_inventory(id_inventory), > //A new grid is available each 3 hours since 5 years ago > dh_date timestamp, > //Values are scaled to be stored as signed integers of 2 bytes > vl_grid smallint[361][720], > CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY > (co_inventory, dh_date) > ); > > Dimensions of each value of field vl_grid are (lat:361 x lon:720 = 259920 > cells} for a grid of 0.5 degrees (about each 55 Km) around the world. So, > vl_grid[y][x] stores the value at dh_date of a meteorological variable > specified by id_inventory in the geodesic point > > latitude = -90 + y*0.5 > longitude = x*0.5 > > The reverse formula for the closest point in the grid of an arbitary > geodesic point will be > > y = Round((latitude+90) * 2 > x = Round(longitude*2) > > Field vl_grid is stored in the TOAST table and has a good compression level. > PostgreSql is the only one database that is able to store this huge amount > of data in only 34 GB of disk. It's really great system. Queries returning > big rectangular areas are very fast, but the target of almost all queries is > to get historical series for a geodesic point > > SELECT dh_date, vl_grid[123][152] > FROM grid_f_data_i2 > WHERE id_inventory = 6 > ORDER BY dh_date > > In this case, atomic access to just a cell of each one of a only few > thousands of rows becomes too slow. > > Please, could somebody answer some of these questions? > > - It's posible to tune some TOAST parameters to get faster atomic access > to large arrays? > > > - Using "EXTERNAL" strategy for storing TOAST-able columns could solve > the problem? > > > - Atomic access will be faster if table stores vectors for data in the > same parallel instead of matrices of global data? > CREATE TABLE grid_f_data_i2 ( > //Specifies the variable and other features of data > id_inventory integer REFERENCES grid_d_inventory(id_inventory), > //A new grid is available each 3 hours since 5 years ago > dh_date timestamp, > // nu_parallel = y = Round((latitude+90) * 2 > smallint nu_parallel, > //Values are scaled to be stored as signed integers of 2 bytes > vl_parallel smallint[], > CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY > (co_inventory, nu_parallel, dh_date) > ); > > - There is another faster solution? > > Thanks in advance and best regards > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Hi
I'm storing historical meteorological gridded data from GFS (http://www.nco.ncep.noaa.gov/pmb/products/gfs/) into an array field in a table like this:
Dimensions of each value of field vl_grid are (lat:361 x lon:720 = 259920 cells} for a grid of 0.5 degrees (about each 55 Km) around the world. So, vl_grid[y][x] stores the value at dh_date of a meteorological variable specified by id_inventory in the geodesic point
The reverse formula for the closest point in the grid of an arbitary geodesic point will be
Field vl_grid is stored in the TOAST table and has a good compression level. PostgreSql is the only one database that is able to store this huge amount of data in only 34 GB of disk. It's really great system. Queries returning big rectangular areas are very fast, but the target of almost all queries is to get historical series for a geodesic point
In this case, atomic access to just a cell of each one of a only few thousands of rows becomes too slow.
Using standar way, size increase very much
This table have (4+8+2+2+2=24) bytes by register and (lat:361 x lon:720 = 259920) registers by grid, so, 6238080 bytes by grid.
Uncompressed array design uses 4+8+2*259920=519852 bytes by register but just one register by grid.
TOAST table compress these arrays with an average factor 1/2, so, the total size with arrays is 24 times lesser than standard way.
Now, I have more than 60000 stored grids in 30 GB, instead of 720 GB, but probably I will store 1 million of grids or more in 0.5 TB instead of 12 TB.
I have no enougth money to buy nor maintain 12 TB disks.
Please, could somebody answer some of these questions?
--
Víctor de Buen Remiro
Tol Development Team member
www.tol-project.org
I'm storing historical meteorological gridded data from GFS (http://www.nco.ncep.noaa.gov/pmb/products/gfs/) into an array field in a table like this:
CREATE TABLE grid_f_data_i2 (
//Specifies the variable and other features of data
id_inventory integer REFERENCES grid_d_inventory(id_inventory),
//A new grid is available each 3 hours since 5 years ago
dh_date timestamp,
//Values are scaled to be stored as signed integers of 2 bytes
vl_grid smallint[361][720],
CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY
(co_inventory, dh_date)
);
//Specifies the variable and other features of data
id_inventory integer REFERENCES grid_d_inventory(id_inventory),
//A new grid is available each 3 hours since 5 years ago
dh_date timestamp,
//Values are scaled to be stored as signed integers of 2 bytes
vl_grid smallint[361][720],
CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY
(co_inventory, dh_date)
);
Dimensions of each value of field vl_grid are (lat:361 x lon:720 = 259920 cells} for a grid of 0.5 degrees (about each 55 Km) around the world. So, vl_grid[y][x] stores the value at dh_date of a meteorological variable specified by id_inventory in the geodesic point
latitude = -90 + y*0.5
longitude = x*0.5
longitude = x*0.5
The reverse formula for the closest point in the grid of an arbitary geodesic point will be
y = Round((latitude+90) * 2
x = Round(longitude*2)
x = Round(longitude*2)
Field vl_grid is stored in the TOAST table and has a good compression level. PostgreSql is the only one database that is able to store this huge amount of data in only 34 GB of disk. It's really great system. Queries returning big rectangular areas are very fast, but the target of almost all queries is to get historical series for a geodesic point
SELECT dh_date, vl_grid[123][152]
FROM grid_f_data_i2
WHERE id_inventory = 6
ORDER BY dh_date
FROM grid_f_data_i2
WHERE id_inventory = 6
ORDER BY dh_date
In this case, atomic access to just a cell of each one of a only few thousands of rows becomes too slow.
Using standar way, size increase very much
CREATE TABLE grid_f_data_i2 (
id_inventory integer REFERENCES grid_d_inventory(id_inventory),
dh_date timestamp,
smallint lat,
smallint lon,
smallint value
};
smallint lat,
smallint lon,
smallint value
};
This table have (4+8+2+2+2=24) bytes by register and (lat:361 x lon:720 = 259920) registers by grid, so, 6238080 bytes by grid.
Uncompressed array design uses 4+8+2*259920=519852 bytes by register but just one register by grid.
TOAST table compress these arrays with an average factor 1/2, so, the total size with arrays is 24 times lesser than standard way.
Now, I have more than 60000 stored grids in 30 GB, instead of 720 GB, but probably I will store 1 million of grids or more in 0.5 TB instead of 12 TB.
I have no enougth money to buy nor maintain 12 TB disks.
Please, could somebody answer some of these questions?
- It's posible to tune some TOAST parameters to get faster atomic access to large arrays?
- Using "EXTERNAL" strategy for storing TOAST-able columns could solve the problem?
- Atomic access will be faster if table stores vectors for data in the same parallel instead of matrices of global data?
CREATE TABLE grid_f_data_i2 (
//Specifies the variable and other features of data
id_inventory integer REFERENCES grid_d_inventory(id_inventory),
//A new grid is available each 3 hours since 5 years ago
dh_date timestamp,
// nu_parallel = y = Round((latitude+90) * 2
smallint nu_parallel,
//Values are scaled to be stored as signed integers of 2 bytes
vl_parallel smallint[],
CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY
(co_inventory, nu_parallel, dh_date)
); - There is another faster solution?
--
Víctor de Buen Remiro
Tol Development Team member
www.tol-project.org
"Victor de Buen (Bayes)" <vdebuen@bayesinf.com> writes: > I'm storing historical meteorological gridded data from GFS ( > http://www.nco.ncep.noaa.gov/pmb/products/gfs/) into an array field in a > table like this: > vl_grid smallint[361][720], > - It's posible to tune some TOAST parameters to get faster atomic access > to large arrays? It might save a little bit to make the toast chunk size larger, but I'm not sure you could gain much from that. > - Using "EXTERNAL" strategy for storing TOAST-able columns could solve > the problem? Nope, wouldn't help --- AFAIR array access is not optimized for slice access. In any case, doing that would give up the compression savings that you were so happy about. If your normal access patterns involve "vertical" rather than "horizontal" scans of the data, maybe you should rethink the choice of table layout. Or maybe the compression is enough to allow you to consider storing the data twice, once in the current layout and once in a "vertical" format. regards, tom lane
Thank you very much, Tom
I will try vector 'parallel' and 'vertical' strategies.
Regards
--
Víctor de Buen Remiro
Consultor estadístico
Bayes Forecast
www.bayesforecast.com
Tol Development Team member
www.tol-project.org
I will try vector 'parallel' and 'vertical' strategies.
Regards
2009/7/22 Tom Lane <tgl@sss.pgh.pa.us>
"Victor de Buen (Bayes)" <vdebuen@bayesinf.com> writes:
> I'm storing historical meteorological gridded data from GFS (
> http://www.nco.ncep.noaa.gov/pmb/products/gfs/) into an array field in a
> table like this:> vl_grid smallint[361][720],> - It's posible to tune some TOAST parameters to get faster atomic access
> to large arrays?
It might save a little bit to make the toast chunk size larger, but I'm
not sure you could gain much from that.
> - Using "EXTERNAL" strategy for storing TOAST-able columns could solve
> the problem?
Nope, wouldn't help --- AFAIR array access is not optimized for slice
access. In any case, doing that would give up the compression savings
that you were so happy about.
If your normal access patterns involve "vertical" rather than
"horizontal" scans of the data, maybe you should rethink the choice
of table layout. Or maybe the compression is enough to allow you
to consider storing the data twice, once in the current layout and
once in a "vertical" format.
regards, tom lane
--
Víctor de Buen Remiro
Consultor estadístico
Bayes Forecast
www.bayesforecast.com
Tol Development Team member
www.tol-project.org
Thank you very much, Tom
I will try vector 'parallel' and 'vertical' strategies.
Regards
--
Víctor de Buen Remiro
Consultor estadístico
Bayes Forecast
www.bayesforecast.com
Tol Development Team member
www.tol-project.org
I will try vector 'parallel' and 'vertical' strategies.
Regards
2009/7/22 Tom Lane <tgl@sss.pgh.pa.us>
"Victor de Buen (Bayes)" <vdebuen@bayesinf.com> writes:
> I'm storing historical meteorological gridded data from GFS (
> http://www.nco.ncep.noaa.gov/pmb/products/gfs/) into an array field in a
> table like this:> vl_grid smallint[361][720],> - It's posible to tune some TOAST parameters to get faster atomic access
> to large arrays?
It might save a little bit to make the toast chunk size larger, but I'm
not sure you could gain much from that.
> - Using "EXTERNAL" strategy for storing TOAST-able columns could solve
> the problem?
Nope, wouldn't help --- AFAIR array access is not optimized for slice
access. In any case, doing that would give up the compression savings
that you were so happy about.
If your normal access patterns involve "vertical" rather than
"horizontal" scans of the data, maybe you should rethink the choice
of table layout. Or maybe the compression is enough to allow you
to consider storing the data twice, once in the current layout and
once in a "vertical" format.
regards, tom lane
--
Víctor de Buen Remiro
Consultor estadístico
Bayes Forecast
www.bayesforecast.com
Tol Development Team member
www.tol-project.org
On Tue, Jul 21, 2009 at 7:43 PM, Victor de Buen (Bayes)<vdebuen@bayesinf.com> wrote: > Hi > > I'm storing historical meteorological gridded data from GFS > (http://www.nco.ncep.noaa.gov/pmb/products/gfs/) into an array field in a > table like this: > > CREATE TABLE grid_f_data_i2 ( > //Specifies the variable and other features of data > id_inventory integer REFERENCES grid_d_inventory(id_inventory), > //A new grid is available each 3 hours since 5 years ago > dh_date timestamp, > //Values are scaled to be stored as signed integers of 2 bytes > vl_grid smallint[361][720], > CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY > (co_inventory, dh_date) > ); > > Dimensions of each value of field vl_grid are (lat:361 x lon:720 = 259920 > cells} for a grid of 0.5 degrees (about each 55 Km) around the world. So, > vl_grid[y][x] stores the value at dh_date of a meteorological variable > specified by id_inventory in the geodesic point > > latitude = -90 + y*0.5 > longitude = x*0.5 > > The reverse formula for the closest point in the grid of an arbitary > geodesic point will be > > y = Round((latitude+90) * 2 > x = Round(longitude*2) > > Field vl_grid is stored in the TOAST table and has a good compression level. > PostgreSql is the only one database that is able to store this huge amount > of data in only 34 GB of disk. It's really great system. Queries returning > big rectangular areas are very fast, but the target of almost all queries is > to get historical series for a geodesic point > > SELECT dh_date, vl_grid[123][152] > FROM grid_f_data_i2 > WHERE id_inventory = 6 > ORDER BY dh_date > > In this case, atomic access to just a cell of each one of a only few > thousands of rows becomes too slow. That's a side effect of your use of arrays. Arrays are very compact, and ideal if you always want the whole block of data at once, but asking for particular point is the down side of your trade off. I would suggest maybe experimenting with smaller grid sizes...maybe divide your big grid into approximately 16 (4x4) separate subgrids. This should still 'toast', and give decent compression, but mitigate the impact of single point lookup somewhat. merlin
Thank a lot, Merlin.
I will try to fill a sample of grids in a new table with different sizes of subgrids in order to get the better relation between space and speed.
Regards
--
Víctor de Buen Remiro
Consultor estadístico
Bayes Forecast
www.bayesforecast.com
Tol Development Team member
www.tol-project.org
I will try to fill a sample of grids in a new table with different sizes of subgrids in order to get the better relation between space and speed.
Regards
2009/7/22 Merlin Moncure <mmoncure@gmail.com>
That's a side effect of your use of arrays. Arrays are very compact,On Tue, Jul 21, 2009 at 7:43 PM, Victor de Buen
(Bayes)<vdebuen@bayesinf.com> wrote:
> Hi
>
> I'm storing historical meteorological gridded data from GFS
> (http://www.nco.ncep.noaa.gov/pmb/products/gfs/) into an array field in a
> table like this:
>
> CREATE TABLE grid_f_data_i2 (
> //Specifies the variable and other features of data
> id_inventory integer REFERENCES grid_d_inventory(id_inventory),
> //A new grid is available each 3 hours since 5 years ago
> dh_date timestamp,
> //Values are scaled to be stored as signed integers of 2 bytes
> vl_grid smallint[361][720],
> CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY
> (co_inventory, dh_date)
> );
>
> Dimensions of each value of field vl_grid are (lat:361 x lon:720 = 259920
> cells} for a grid of 0.5 degrees (about each 55 Km) around the world. So,
> vl_grid[y][x] stores the value at dh_date of a meteorological variable
> specified by id_inventory in the geodesic point
>
> latitude = -90 + y*0.5
> longitude = x*0.5
>
> The reverse formula for the closest point in the grid of an arbitary
> geodesic point will be
>
> y = Round((latitude+90) * 2
> x = Round(longitude*2)
>
> Field vl_grid is stored in the TOAST table and has a good compression level.
> PostgreSql is the only one database that is able to store this huge amount
> of data in only 34 GB of disk. It's really great system. Queries returning
> big rectangular areas are very fast, but the target of almost all queries is
> to get historical series for a geodesic point
>
> SELECT dh_date, vl_grid[123][152]
> FROM grid_f_data_i2
> WHERE id_inventory = 6
> ORDER BY dh_date
>
> In this case, atomic access to just a cell of each one of a only few
> thousands of rows becomes too slow.
and ideal if you always want the whole block of data at once, but
asking for particular point is the down side of your trade off. I
would suggest maybe experimenting with smaller grid sizes...maybe
divide your big grid into approximately 16 (4x4) separate subgrids.
This should still 'toast', and give decent compression, but mitigate
the impact of single point lookup somewhat.
merlin
--
Víctor de Buen Remiro
Consultor estadístico
Bayes Forecast
www.bayesforecast.com
Tol Development Team member
www.tol-project.org