Google BigQuery Study Notes

Overview

In the study note series, this post covers Google BigQuery. All details are accurate at the time of writing, please refer to Google for current details.

Notes

  • BQ has all window functions, string, array and struct, user defined functions etc.
  • Can handle external functions in JavaScript. There are limitations in size and performance
  • Access control is via views that live in a separate dataset. Access Control Lists (ACL) are applied to all tables and views in a dataset
  • BigQuery has CTE type concept WITH Clause 
  • ARRAYS and STRUCTS are Standard SQL stores data in arrays natively
  • BigQuery can work with both native, internal storage and external files.
  • In BQ, every column is in a separate file. Columnar
  • Multiple owners in a project
  • You can only run 6 JavaScript UDFs at once
  • BQ ML enables users to create and execute machine learning models in BigQuery by using SQL queries
  • BQ was written by Dremel
  • Immutable audit logs
  • Older data is discounted
  • Storage cost is similar to GCP
  • You can ingest JSON

Compatible Analytic Tools

Following are examples of tools with connectors to BigQuery:

  • Looker
  • QLIK
  • Tableau
  • Data Studio
  • Excel

From https://cloud.google.com/BigQuery-ml/docs/bigqueryml-web-ui-start

Choosing BigQuery

BQ is a fully managed, petabyte scale, analytics data warehouse, it gives way to:

  • BigTable (cheap if it suits)
  • Spanner (expensive, many nodes)

Data Types

  • INT64
  • STRING
  • FLOAT64
  • BOOL
  • ARRAY
  • STRUCT
  • TIMESTAMP

Nesting

BigQuery allows for nesting of data within the same field but relies on the use of JSON, stored in a RECORD type column with a REPEAT mode

User Defined Functions

UDF can be written in Written in JavaScript or SQL.

  • Standard SQL UDFs are scalar
  • Legacy SQL UDFs are tabular

To ensure performance use SQL based UDFs finally JS

Performance Techniques

To ensure BQ performs:

  • Don’t SELECT *
  • Filter early and often
  • Biggest join first, don’t self-join
  • Low cardinality ‘group bys’ are faster
  • Use approx. where possible
  • Sort at the last step
  • Use inbuilt functionality first, then SQL based UDFs finally JS
  • Partitioning is good but introduces an overhead for users to understand
  • Long prefixes are better for performance when using _TABLE_SUFFIX

Performance Analysis

To performance analyse a query use:

  • Per-query explain plans (what did my query do?)
  • Stackdriver (what is going on with all my resources in this project?)

Slots

BigQuery slot is a unit of computational capacity required to execute SQL queries.

Cost

The following should be considered when estimating cost:

Storage:

  • Data in table
  • Ingest rate
  • Discount of older data

Processing:

  • On Demand or Flat Rate
  • On Demand based on the data processed
  • 1 TB/month free
  • Opting in to run high compute queries 

Free:

The following are free:

  • Loading
  • Exporting
  • Queries on Metadata
  • Cached queries
  • Queries with errors
  • Use the query validator with pricing estimates

Quota:

You manage costs by requesting a custom quota that specifies a limit on the amount of query data processed per day. You can control costs at the project-level or at the user level.

  • Project-level custom quotas limit the total project usage.
  • User-level custom quotas are separately applied to each user or service account

Schema Changes  

Renaming a column is not supported by the GCP Console, the classic BigQuery web UI, the command-line tool, or the API. You can ADD columns without needing to recreate the table

Prefixes

The query prefixes #legacySQL and #standardSQL can be used when writing SQL. They:

  • Are NOT case-sensitive
  • Must precede the query
  • Must be separated from the query by a newline character

From https://cloud.google.com/bigquery/docs/reference/standard-sql/enabling-standard-sql

Standard vs. Legacy SQL

BQ has two types of SQL syntax:

Legacy

SELECTProduct_code, Item_Qty, Item_ValueFROM [bigquery-public-data:samples.orders]ORDER BY order_value DESCLIMIT 10;

Standard

SELECTProduct_code, Item_Qty, Item_ValueFROM bigquery-public-data.samples.ordersORDER BY order_value DESCLIMIT 10;
Or

SELECTProduct_code, Item_Qty, Item_ValueFROM bigquery-public-data.samples.ordersORDER BY order_value DESCLIMIT 10;
Other examples with different levels of qualifying:

SELECT * FROM Order;

SELECT * FROM dataset.Order;
SELECT * FROM project.dataset.Order;

From https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax

Commands

BQ has the following commands:

  • INSERT
  • UPDATE
  • DELETE
  • MERGE

Wild Cards

BQ allows the use of wild cards when writing SQL. The following are examples in standard SQL:

Suffix
_TABLE_SUFFIX
Data Range _TABLE_DATE_RANGE
Streaming But Not Partitioned WHERE _PARTITIONTIME IS NULL
Date Partition WHERE _PARTITIONTIME = TIMESTAMP(‘2016-03-28’)) t1
Date Partition WHERE _PARTITIONDATE = ‘2016-03-28’) t1
Multiple Tables in Dataset FROM `bigquery-public-data.noaa_gsod.*`
Multiple Tables in Dataset FROM `bigquery-public-data.noaa_gsod.gsod194*`

With legacy SQL you need to do something like this
SELECT nameFROMTABLE_DATE_RANGE([myproject-1234:mydata.people],TIMESTAMP(‘2014-03-25’), TIMESTAMP(‘2014-03-27’)) WHERE  age >= 35

Partitioning

Partitioned tables are possible and perform well, especially when using a date column for partitioning

Loading

Batch or Stream       

BigQuery supports both batch & streaming data from DataFlow. Batching data to BigQuery is free of charge. Streaming data on the other hand is charged by size.

Schema Auto Detection

Schema auto-detection: Schema auto-detection is available when you load data into BigQuery, and when you query an external data source. BigQuery makes a best-effort attempt to automatically infer the schema for CSV and JSON files.

Encoding

BigQuery supports UTF-8 encoding for both nested or repeated and flat data. BigQuery supports ISO-8859-1 encoding for flat data only for CSV files. By default, the BigQuery service expects all source data to be UTF-8 encoded. Explicitly specify the encoding when you import your data so that BigQuery can properly convert your data to UTF-8 during the import process.

External Data

An external data source (also known as a federated data source) is a data source that you can query directly even though the data is not stored in BigQuery. Instead of loading or streaming the data, you create a table that references the external data source. Takes this approach when

  • Perform ETL operations on data.
  • Frequently changed data.
  • Data is being ingested periodically.

Cache

Query results are not cached:

  • When a destination table is specified in the job configuration, the GCP Console, the classic web UI, the command line, or the API
  • If any of the referenced tables or logical views have changed since the results were previously cached
  • When any of the tables referenced by the query have recently received streaming inserts
  • If the query uses non-deterministic functions; for example, date and time functions such as CURRENT_TIMESTAMP()
  • If you are querying multiple tables using a wildcard
  • If the cached results have expired
  • If the query runs against an external data source

If you are using the GCP Console or the classic BigQuery web UI, the result does not contain information about the number of processed bytes, and displays the word “cached”.

Typical cache lifetime is 24 hours, but the cached results are best-effort and may be invalidated sooner

From https://cloud.google.com/bigquery/docs/cached-results

Permissions

Title Description Lowest Level
Admin Provides permissions to manage all resources within the project. Can manage all data within the project and can cancel jobs from other users running within the project. Project
Data Owner Read, update, and delete the dataset. Create, update, get, and delete the dataset’s tables. When applied at the project or organization level, this role can also create new datasets. Data Set
Data Editor Read the dataset’s metadata and to list tables in the dataset. Create, update, get, and delete the dataset’s tables. Data Set
BigQuery User Provides permissions to run jobs, including queries, within the project. The user role can enumerate their own jobs, cancel their own jobs, and enumerate datasets within a project. Additionally, allows the creation of new datasets within the project; the creator is granted the bigquery.dataOwner role for these new datasets. Project
Job User Job User role can enumerate their own jobs, Run and cancel their own jobs. Project
Data Viewer Read the dataset’s metadata and to list tables in the dataset. Read data and metadata from the dataset’s tables. Data Set
Metadata Viewer List all datasets and read metadata for all datasets in the project. List all tables and views and read metadata for all tables and views in the project. Project

 From https://cloud.google.com/bigquery/docs/access-control

Leave a Reply

Your email address will not be published. Required fields are marked *