Building the perfect cost effective data architecture isn’t easy. With the advent of on-demand systems that only engage when your data is actually being used, costs become harder to predict and can easily spiral out of control. And BigQuery is no exception to this trend. We spoke to a company that uses BigQuery heavily and they shared some of the challenges that they encountered when they tried to move to a near real time data architecture on it. Because the charge model for BigQuery use based on storage scanned, the data access methods and processing cadence used with it are extremely important considerations. We will discuss the top reasons why BigQuery usage costs can grow unexpectedly.
Re-processing Data Too Often
It’s very common to process aggregated data. Existing daily, monthly, and annual data sets can usually be combined and summarized to create meaningful reports. However, rather than performing the combination on the existing data, some analysts chose the less complex path of simply recalculating data across a long historical time window. This method removes the need to handle changing data and perform delta processing. Google benefits from customers that take this approach because their queries scan a lot of data, which triggers a lot of usage charge. Customers that only run queries like this once a month or even once a day typically aren’t hit too hard. But the cost for customers that refresh their data every 30 minutes in this way is 48 times higher than the cost for those that only do it once per day. Can you afford that kind of jump in your bill?
Insufficient Granularity of Partitions
The granularity of partitions isn’t a major factor when you’re dealing with a small amount of data. But, when you’re processing multiple terabytes of data during each query or pipeline cycle, all cost factors become significant. The partition strategy most commonly used is to split data by day and then add clustered columns to handle granularities smaller than a day. But, when you process a large volume of data, a daily granularity is not adequate. In September of this year, Google introduced support for hourly partitions, which is a powerful option to help address this issue. Another option is to store timestamps in UNIX format and use integer based partitioning with them. This approach also allows you to directly control the granularity of your range, allowing you to create partitions that are less than a day or even less than an hour. Smaller ranged partitions are a more efficient and cost effective way to process big data, but you will need to contend with the 4000 partition limitation of BigQuery. As long as you architect your data pipelines accordingly, this limitation is definitely surmountable.
Making Fat Tables and Selecting Too Much Data
BigQuery is a columnar data store. As such, it is subject to many of the performance issues with columnar datastores. However, due to its on-demand scaling of resources, BigQuery doesn’t actually exhibit the issues as slow performance. More hardware is automatically engaged to continue to deliver high performance despite the issues. However, the extra resources used directly translate into higher charges. Many companies fall into this unseen pitfall and only realize their issues when they see their giant bill at the end of the month. The core problem is that tables are being created that are too wide and contain too many blob columns. Queries made against those tables use an excessive amount of CPU cycles and memory. You should try to avoid using SELECT * against any table and keep the number of columns in each table small. Focus your processing on only the columns you need in order to avoid hitting any unexpected costs.
Overall, BigQuery is a solid product that greatly simplifies data warehousing. You don’t have to hire a data engineer to create your architecture, but do avoid the approaches commonly used when interacting with standard relational databases, because they can easily lead to unexpected and uncontrolled costs outside of that context. With the right tools and methods, you can save millions of dollars and still reap the benefits of a highly scalable data architecture with BigQuery.