Overview
In the study note series, this post covers Google BigQuery. All details are accurate at the time of writing, please refer to Google for current details.
Notes
- BQ has all window functions, string, array and struct, user defined functions etc.
- Can handle external functions in JavaScript. There are limitations in size and performance
- Access control is via views that live in a separate dataset. Access Control Lists (ACL) are applied to all tables and views in a dataset
- BigQuery has CTE type concept WITH Clause
- ARRAYS and STRUCTS are Standard SQL stores data in arrays natively
- BigQuery can work with both native, internal storage and external files.
- In BQ, every column is in a separate file. Columnar
- Multiple owners in a project
- You can only run 6 JavaScript UDFs at once
- BQ ML enables users to create and execute machine learning models in BigQuery by using SQL queries
- BQ was written by Dremel
- Immutable audit logs
- Older data is discounted
- Storage cost is similar to GCP
- You can ingest JSON
Compatible Analytic Tools
Following are examples of tools with connectors to BigQuery:
- Looker
- QLIK
- Tableau
- Data Studio
- Excel
From https://cloud.google.com/BigQuery-ml/docs/bigqueryml-web-ui-start
Choosing BigQuery
BQ is a fully managed, petabyte scale, analytics data warehouse, it gives way to:
- BigTable (cheap if it suits)
- Spanner (expensive, many nodes)
Data Types
- INT64
- STRING
- FLOAT64
- BOOL
- ARRAY
- STRUCT
- TIMESTAMP
Nesting
BigQuery allows for nesting of data within the same field but relies on the use of JSON, stored in a RECORD type column with a REPEAT mode
User Defined Functions
UDF can be written in Written in JavaScript or SQL.
- Standard SQL UDFs are scalar
- Legacy SQL UDFs are tabular
To ensure performance use SQL based UDFs finally JS
Performance Techniques
To ensure BQ performs:
- Don’t SELECT *
- Filter early and often
- Biggest join first, don’t self-join
- Low cardinality ‘group bys’ are faster
- Use approx. where possible
- Sort at the last step
- Use inbuilt functionality first, then SQL based UDFs finally JS
- Partitioning is good but introduces an overhead for users to understand
- Long prefixes are better for performance when using _TABLE_SUFFIX
Performance Analysis
To performance analyse a query use:
- Per-query explain plans (what did my query do?)
- Stackdriver (what is going on with all my resources in this project?)
Slots
BigQuery slot is a unit of computational capacity required to execute SQL queries.
Cost
The following should be considered when estimating cost:
Storage:
- Data in table
- Ingest rate
- Discount of older data
Processing:
- On Demand or Flat Rate
- On Demand based on the data processed
- 1 TB/month free
- Opting in to run high compute queries
Free:
The following are free:
- Loading
- Exporting
- Queries on Metadata
- Cached queries
- Queries with errors
- Use the query validator with pricing estimates
Quota:
You manage costs by requesting a custom quota that specifies a limit on the amount of query data processed per day. You can control costs at the project-level or at the user level.
- Project-level custom quotas limit the total project usage.
- User-level custom quotas are separately applied to each user or service account
Schema Changes
Renaming a column is not supported by the GCP Console, the classic BigQuery web UI, the command-line tool, or the API. You can ADD columns without needing to recreate the table
Prefixes
The query prefixes #legacySQL and #standardSQL can be used when writing SQL. They:
- Are NOT case-sensitive
- Must precede the query
- Must be separated from the query by a newline character
From https://cloud.google.com/bigquery/docs/reference/standard-sql/enabling-standard-sql
Standard vs. Legacy SQL
BQ has two types of SQL syntax:
Legacy
SELECTProduct_code, Item_Qty, Item_ValueFROM [bigquery-public-data:samples.orders]ORDER BY order_value DESCLIMIT 10;
Standard
SELECTProduct_code, Item_Qty, Item_ValueFROM bigquery-public-data.samples.ordersORDER BY order_value DESCLIMIT 10;
Or
SELECTProduct_code, Item_Qty, Item_ValueFROM ‘bigquery-public-data.samples.orders’ORDER BY order_value DESCLIMIT 10;
Other examples with different levels of qualifying:
SELECT * FROM Order;
SELECT * FROM dataset.Order;
SELECT * FROM project.dataset.Order;
From https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
Commands
BQ has the following commands:
- INSERT
- UPDATE
- DELETE
- MERGE
Wild Cards
BQ allows the use of wild cards when writing SQL. The following are examples in standard SQL:
Suffix
| _TABLE_SUFFIX |
Data Range | _TABLE_DATE_RANGE |
Streaming But Not Partitioned | WHERE _PARTITIONTIME IS NULL |
Date Partition | WHERE _PARTITIONTIME = TIMESTAMP(‘2016-03-28’)) t1 |
Date Partition | WHERE _PARTITIONDATE = ‘2016-03-28’) t1 |
Multiple Tables in Dataset | FROM `bigquery-public-data.noaa_gsod.*` |
Multiple Tables in Dataset | FROM `bigquery-public-data.noaa_gsod.gsod194*` |
With legacy SQL you need to do something like this
SELECT nameFROMTABLE_DATE_RANGE([myproject-1234:mydata.people],TIMESTAMP(‘2014-03-25’),
TIMESTAMP(‘2014-03-27’))
WHERE
age >= 35
Partitioning
Partitioned tables are possible and perform well, especially when using a date column for partitioning
Loading
Batch or Stream
BigQuery supports both batch & streaming data from DataFlow. Batching data to BigQuery is free of charge. Streaming data on the other hand is charged by size.
Schema Auto Detection
Schema auto-detection: Schema auto-detection is available when you load data into BigQuery, and when you query an external data source. BigQuery makes a best-effort attempt to automatically infer the schema for CSV and JSON files.
Encoding
BigQuery supports UTF-8 encoding for both nested or repeated and flat data. BigQuery supports ISO-8859-1 encoding for flat data only for CSV files. By default, the BigQuery service expects all source data to be UTF-8 encoded. Explicitly specify the encoding when you import your data so that BigQuery can properly convert your data to UTF-8 during the import process.
External Data
An external data source (also known as a federated data source) is a data source that you can query directly even though the data is not stored in BigQuery. Instead of loading or streaming the data, you create a table that references the external data source. Takes this approach when
- Perform ETL operations on data.
- Frequently changed data.
- Data is being ingested periodically.
Cache
Query results are not cached:
- When a destination table is specified in the job configuration, the GCP Console, the classic web UI, the command line, or the API
- If any of the referenced tables or logical views have changed since the results were previously cached
- When any of the tables referenced by the query have recently received streaming inserts
- If the query uses non-deterministic functions; for example, date and time functions such as CURRENT_TIMESTAMP()
- If you are querying multiple tables using a wildcard
- If the cached results have expired
- If the query runs against an external data source
If you are using the GCP Console or the classic BigQuery web UI, the result does not contain information about the number of processed bytes, and displays the word “cached”.
Typical cache lifetime is 24 hours, but the cached results are best-effort and may be invalidated sooner
From https://cloud.google.com/bigquery/docs/cached-results
Permissions
Title | Description | Lowest Level |
---|---|---|
Admin | Provides permissions to manage all resources within the project. Can manage all data within the project and can cancel jobs from other users running within the project. | Project |
Data Owner | Read, update, and delete the dataset. Create, update, get, and delete the dataset’s tables. When applied at the project or organization level, this role can also create new datasets. | Data Set |
Data Editor | Read the dataset’s metadata and to list tables in the dataset. Create, update, get, and delete the dataset’s tables. | Data Set |
BigQuery User | Provides permissions to run jobs, including queries, within the project. The user role can enumerate their own jobs, cancel their own jobs, and enumerate datasets within a project. Additionally, allows the creation of new datasets within the project; the creator is granted the bigquery.dataOwner role for these new datasets. | Project |
Job User | Job User role can enumerate their own jobs, Run and cancel their own jobs. | Project |
Data Viewer | Read the dataset’s metadata and to list tables in the dataset. Read data and metadata from the dataset’s tables. | Data Set |
Metadata Viewer | List all datasets and read metadata for all datasets in the project. List all tables and views and read metadata for all tables and views in the project. | Project |