MJN All Blog Cheatsheets Elasticsearch GCP JS LinuxBash Misc Notes Other ShortcutKeys / - Search

Home / GCP / BigQuery - Performance


GCP BigQuery Performance

Background

BigQuery is built on Googles Colossus filesystem and Dremel massively distributed query engine. BigQuery is a Columnar database / datastore.

Queries are split and run over many worker nodes, called slots. The ‘shuffle’, where slots share data, runs over the Google high performance network.

Loading / Linking Data

Data can be loaded into or linked to BQ. Queries over loaded data is more performant.

Data can be linked from Cloud Storage (CS), currently stored in csv, Avro, JSON or Data Store backup format. BQ can also link direct to google analytics and google suite (eg sheets).

Batch Loading Data

Best practice is to copy data to CS and then load to BQ. Command line ‘bq’ tool is efficient load tool.

Streaming Data

Data can be Streamed into BQ from PubSub via Cloud Dataflow. Alternatively the BQ API can be used, perhaps from Python, Go, Scala.

Query Performance / Cost ($)

Note: +COST means beneficial in terms of cost. +PERF means beneficial in terms of performance.

Updates


This page was generated by GitHub Pages. Page last modified: 20/09/07 12:50