Google BigQuery Study Notes

Overview

In the study note series, this post covers Google BigQuery. All details are accurate at the time of writing, please refer to Google for current details.

Notes

BQ has all window functions, string, array and struct, user defined functions etc.
Can handle external functions in JavaScript. There are limitations in size and performance
Access control is via views that live in a separate dataset. Access Control Lists (ACL) are applied to all tables and views in a dataset
BigQuery has CTE type concept WITH Clause
ARRAYS and STRUCTS are Standard SQL stores data in arrays natively
BigQuery can work with both native, internal storage and external files.
In BQ, every column is in a separate file. Columnar
Multiple owners in a project
You can only run 6 JavaScript UDFs at once
BQ ML enables users to create and execute machine learning models in BigQuery by using SQL queries
BQ was written by Dremel
Immutable audit logs
Older data is discounted
Storage cost is similar to GCP
You can ingest JSON

Compatible Analytic Tools

Following are examples of tools with connectors to BigQuery:

Looker
QLIK
Tableau
Data Studio
Excel

From https://cloud.google.com/BigQuery-ml/docs/bigqueryml-web-ui-start

Choosing BigQuery

BQ is a fully managed, petabyte scale, analytics data warehouse, it gives way to:

BigTable (cheap if it suits)
Spanner (expensive, many nodes)

Data Types

INT64
STRING
FLOAT64
BOOL
ARRAY
STRUCT
TIMESTAMP

Nesting

BigQuery allows for nesting of data within the same field but relies on the use of JSON, stored in a RECORD type column with a REPEAT mode

User Defined Functions

UDF can be written in Written in JavaScript or SQL.

Standard SQL UDFs are scalar
Legacy SQL UDFs are tabular

To ensure performance use SQL based UDFs finally JS

Performance Techniques

To ensure BQ performs:

Don’t SELECT *
Filter early and often
Biggest join first, don’t self-join
Low cardinality ‘group bys’ are faster
Use approx. where possible
Sort at the last step
Use inbuilt functionality first, then SQL based UDFs finally JS
Partitioning is good but introduces an overhead for users to understand
Long prefixes are better for performance when using _TABLE_SUFFIX

Performance Analysis

To performance analyse a query use:

Per-query explain plans (what did my query do?)
Stackdriver (what is going on with all my resources in this project?)

Slots

BigQuery slot is a unit of computational capacity required to execute SQL queries.

Cost

The following should be considered when estimating cost:

Storage:

Data in table
Ingest rate
Discount of older data

Processing:

On Demand or Flat Rate
On Demand based on the data processed
1 TB/month free
Opting in to run high compute queries

Free:

The following are free:

Loading
Exporting
Queries on Metadata
Cached queries
Queries with errors
Use the query validator with pricing estimates

Quota:

You manage costs by requesting a custom quota that specifies a limit on the amount of query data processed per day. You can control costs at the project-level or at the user level.

Project-level custom quotas limit the total project usage.
User-level custom quotas are separately applied to each user or service account

Schema Changes

Renaming a column is not supported by the GCP Console, the classic BigQuery web UI, the command-line tool, or the API. You can ADD columns without needing to recreate the table

Prefixes

The query prefixes #legacySQL and #standardSQL can be used when writing SQL. They:

Are NOT case-sensitive
Must precede the query
Must be separated from the query by a newline character

From https://cloud.google.com/bigquery/docs/reference/standard-sql/enabling-standard-sql

Standard vs. Legacy SQL

BQ has two types of SQL syntax:

Legacy

SELECTProduct_code, Item_Qty, Item_ValueFROM [bigquery-public-data:samples.orders]ORDER BY order_value DESCLIMIT 10;

Standard

SELECTProduct_code, Item_Qty, Item_ValueFROM bigquery-public-data.samples.ordersORDER BY order_value DESCLIMIT 10;
Or

SELECTProduct_code, Item_Qty, Item_ValueFROM ‘bigquery-public-data.samples.orders’ORDER BY order_value DESCLIMIT 10;
Other examples with different levels of qualifying:

SELECT * FROM Order;
SELECT * FROM dataset.Order;
SELECT * FROM project.dataset.Order;

From https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax

Commands

BQ has the following commands:

INSERT
UPDATE
DELETE
MERGE

Wild Cards

BQ allows the use of wild cards when writing SQL. The following are examples in standard SQL:

Suffix	_TABLE_SUFFIX
Data Range	_TABLE_DATE_RANGE
Streaming But Not Partitioned	WHERE _PARTITIONTIME IS NULL
Date Partition	WHERE _PARTITIONTIME = TIMESTAMP(‘2016-03-28’)) t1
Date Partition	WHERE _PARTITIONDATE = ‘2016-03-28’) t1
Multiple Tables in Dataset	FROM `bigquery-public-data.noaa_gsod.*`
Multiple Tables in Dataset	FROM `bigquery-public-data.noaa_gsod.gsod194*`

With legacy SQL you need to do something like this
SELECT nameFROMTABLE_DATE_RANGE([myproject-1234:mydata.people],TIMESTAMP(‘2014-03-25’), TIMESTAMP(‘2014-03-27’)) WHERE age >= 35

Partitioning

Partitioned tables are possible and perform well, especially when using a date column for partitioning

Loading

Batch or Stream

BigQuery supports both batch & streaming data from DataFlow. Batching data to BigQuery is free of charge. Streaming data on the other hand is charged by size.

Schema Auto Detection

Schema auto-detection: Schema auto-detection is available when you load data into BigQuery, and when you query an external data source. BigQuery makes a best-effort attempt to automatically infer the schema for CSV and JSON files.

Encoding

BigQuery supports UTF-8 encoding for both nested or repeated and flat data. BigQuery supports ISO-8859-1 encoding for flat data only for CSV files. By default, the BigQuery service expects all source data to be UTF-8 encoded. Explicitly specify the encoding when you import your data so that BigQuery can properly convert your data to UTF-8 during the import process.

External Data

An external data source (also known as a federated data source) is a data source that you can query directly even though the data is not stored in BigQuery. Instead of loading or streaming the data, you create a table that references the external data source. Takes this approach when

Perform ETL operations on data.
Frequently changed data.
Data is being ingested periodically.

Cache

Query results are not cached:

When a destination table is specified in the job configuration, the GCP Console, the classic web UI, the command line, or the API
If any of the referenced tables or logical views have changed since the results were previously cached
When any of the tables referenced by the query have recently received streaming inserts
If the query uses non-deterministic functions; for example, date and time functions such as CURRENT_TIMESTAMP()
If you are querying multiple tables using a wildcard
If the cached results have expired
If the query runs against an external data source

If you are using the GCP Console or the classic BigQuery web UI, the result does not contain information about the number of processed bytes, and displays the word “cached”.

Typical cache lifetime is 24 hours, but the cached results are best-effort and may be invalidated sooner

From https://cloud.google.com/bigquery/docs/cached-results

Permissions

Title	Description	Lowest Level
Admin	Provides permissions to manage all resources within the project. Can manage all data within the project and can cancel jobs from other users running within the project.	Project
Data Owner	Read, update, and delete the dataset. Create, update, get, and delete the dataset’s tables. When applied at the project or organization level, this role can also create new datasets.	Data Set
Data Editor	Read the dataset’s metadata and to list tables in the dataset. Create, update, get, and delete the dataset’s tables.	Data Set
BigQuery User	Provides permissions to run jobs, including queries, within the project. The user role can enumerate their own jobs, cancel their own jobs, and enumerate datasets within a project. Additionally, allows the creation of new datasets within the project; the creator is granted the bigquery.dataOwner role for these new datasets.	Project
Job User	Job User role can enumerate their own jobs, Run and cancel their own jobs.	Project
Data Viewer	Read the dataset’s metadata and to list tables in the dataset. Read data and metadata from the dataset’s tables.	Data Set
Metadata Viewer	List all datasets and read metadata for all datasets in the project. List all tables and views and read metadata for all tables and views in the project.	Project

From https://cloud.google.com/bigquery/docs/access-control