External Cloud Storage Logs
In our external cloud storage, logs for multiple projects accumulate in the same bucket. To select only the logs related to our project, we created a custom Python script and scheduled it to run daily to perform these tasks:
Authenticate, read and select the data related to our project.
Process the data.
We used the BigQuery stream ingestion API to stream our log data directly into BigQuery. There is also BigQuery Data Transfer Service (DTS) which is a fully managed service to ingest data from Google SaaS apps such as Google Ads, external cloud storage providers such as Amazon S3 and transferring data from data warehouse technologies such as Teradata and Amazon Redshift. DTS automates data movement into BigQuery on a scheduled and managed basis.
Stage 2: Storage in BigQuery
BigQuery organizes data tables into units called datasets. These datasets are scoped to a GCP project. These multiple scopes — project, dataset, and table — help structure information logically. In order to refer to a table from the command line, in SQL queries, or in code, we refer to it by using the following construct: `project.dataset.table`.
BigQuery leverages the columnar storage format and compression algorithm to store data in Colossus, optimized for reading large amounts of structured data. Colossus also handles replication, recovery (when disks crash) and distributed management (so there is no single point of failure). Colossus enables BigQuery users to scale to dozens of petabytes of data stored seamlessly, without paying the penalty of attaching much more expensive compute resources as in traditional data warehouses.
Keeping data in BigQuery is a best practice if you’re looking to optimize both cost and performance. Another best practice is using BigQuery’s table partitioning and clustering features to structure the data to match common data access patterns.
When a table is clustered in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to collocate related data. When new data is added to a table or a specific partition, BigQuery performs automatic re-clustering in the background to restore the sort property of the table or partition. Automatic reclustering is completely free and autonomous for users.
A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. You can typically split large tables into many smaller partitions using data ingestion time or TIMESTAMP/DATE column or an INTEGER column. BigQuery supports the following ways of creating partitioned tables :
Ingestion time partitioned tables
DATE/TIMESTAMP column partitioned tables
INTEGER range partitioned tables
We used ingestion time partitioned BigQuery tables as our data storage. Ingestion time partitioned tables are:
Partitioned on the data’s ingestion time or arrival time.
BigQuery automatically loads data into daily, date based partitions reflecting the data’s ingestion or arrival time.
Partition management is key to fully maximizing BigQuery performance and cost when querying over a specific range — it results in scanning less data per query, and pruning is determined before query start time. While partitioning reduces cost and improves performance, it also prevents cost explosion due to users accidentally querying really large tables in entirety.