21.6. Sampling Functions #
-
TABLESAMPLE (sampling_method(percentage | rows))# Samples a subset of rows from an analytical table or query result. This is useful for analyzing large datasets by working with representative samples, improving query performance for exploratory data analysis.
Example 21.20.
-- Sample 10% of rows from a table SELECT * FROM large_table TABLESAMPLE SYSTEM(10); -- Sample approximately 1000 rows SELECT * FROM events TABLESAMPLE SYSTEM(1000 ROWS); -- Sample from data lake files SELECT * FROM read_parquet('s3://datalake/**/*.parquet') TABLESAMPLE SYSTEM(5); -- Use sampling for quick data profiling SELECT region, COUNT(*) as sample_count, AVG(revenue) as avg_revenue FROM sales_data TABLESAMPLE SYSTEM(2) GROUP BY region; -- Sample from joins for performance SELECT c.name, COUNT(o.id) as order_count FROM customers c JOIN orders o TABLESAMPLE SYSTEM(10) ON c.id = o.customer_id GROUP BY c.name;Sampling methods:
SYSTEM: Random sampling at the storage level (faster, approximate percentage).BERNOULLI: Row-by-row random sampling (slower, exact percentage).
Example 21.21.
-- System sampling (recommended for large tables) SELECT * FROM huge_table TABLESAMPLE SYSTEM(1); -- Bernoulli sampling (exact percentage) SELECT * FROM medium_table TABLESAMPLE BERNOULLI(5);
Use cases:
Data exploration: Quick analysis of large datasets.
Performance testing: Test queries on sample data.
Data profiling: Understand data distribution patterns.
ETL operations development: Develop pipelines on sample data.
Quality checks: Validate data quality on samples.
Required parameters:
Name
Type
Description
sampling_methodkeywordSYSTEMorBERNOULLIpercentagenumericPercentage of rows to sample (0-100)
Optional parameters:
Name
Type
Description
rowsintegerApproximate number of rows to sample (use with ROWS keyword)