In this example, we can see that the amount of bytes processed increased significantly between jobs. This is likely the reason for slowness. We can additionally verify this with the fact that total_slot_ms increased, meaning that it had more slots available and still took longer.
1. Clustering: Depending on your query, you may be able to use clustering to help improve the performance. Clustering will help queries that use filtering and aggregation, as clustering sorts similar columns together. This will reduce the amount of data scanned, but will only show large performance improvements for tables greater than a gigabyte.
2. Minimize input data: In order to mitigate this, try to find out if there is any way to optimize the query text to read only the required data. Some options to do this include filtering early, such as adding WHERE statements in the beginning of the query to filter out unnecessary records or modify the SELECT statement to only include the needed columns, rather than a SELECT *.
3. Denormalize your data: if your data involves parent-child or other hierarchical relationships, try to use nested and repeated fields in your schema. This allows BigQuery to parallelize execution and complete faster.
3. Increased Shuffle Memory
While jobs use slots for the compute resources, they also use shuffle memory to keep track of the job’s state to transition data between execution stages as the query progresses. This shared state ultimately allows for parallel processing and optimizations of your query. Your shuffle memory is correlated to the amount of slots available in a reservation.
Because shuffle is an in-memory operation, there is only a finite amount of memory available for each stage of the query. If there is too much data being processed at any point in time, such as a large join, or if there is a high data skew between joins, it’s possible that a stage can become too intensive and exceed its shuffle memory quota. At this point, shuffle bytes will spill to disk, which causes queries to slow down.
To diagnose this, you should look at two different metrics: both the shuffle memory consumed by the job as well as the slots used. Your shuffle memory quota is tied to your slot capacity, so a stable amount of slots alongside an increase in the amount of shuffle spilled to disk would indicate that this could be the root cause.
Compare the aggregate
shuffle_output_bytes_spilled_to_disk from the TIMELINE view. An increase in bytes spilled to disk suggests that the jobs are stuck, rather than running fast enough to complete on time.