BigQuery or BQ-Security Controls
BigQuery or BQ is a fully managed serverless scalable data warehouse in GCP that helps in performing analysis on Petabytes of data via its built-in query engine.
As per my experience and research, most of the organizations are using BigQuery for depth of data(OLAP) where the goal is to take decisions from very large datasets.
A quick read below will give you an outline about OLAP vs OLTP
What is OLTP?
OLTP system captures and maintains transaction data with a defined format in the database. Ex: Bank related transactions or even POS solution related data.
This is a normalized dataset where focus is on fast processing with low failure rates.
What is OLAP?
OLAP includes writing complex queries for meaningful results from humongous amount of data, usually historical data from CRM etc or the aggregated data from different OLTP dbs.
Usually, the focus here is to get a better sense of data for decision making from huge datasets.
BQ mainly has three important functions to perform:
- Storage: It offers fully managed storage as per which data is located using columns also called “Columnar storage” that is using Google’s proprietary format “Capacitor” in a dfs “Colossus”.
2. Data Ingestion: How you planning to ingest the data in BQ will depends on your data pipeline and org requirements.
You might do it using Cloud storage, Google Sheets, Drive, Pupsub, Dataflow, Bigtable, S3, Azure Blob and File upload ( Batch load for one time analysis. ex: .csv, json etc.). You need to define the schema to organise the data in BQ.
3. Data Query Engine: BQ supports queries tables in standard SQL queries and you can also programmatically interact with BiQuery API using various libraries. BQ uses Dremel to compute the data.
Kind of BQ APIs: ~Good to understand when you are on “Hunt” :-)
- Bigquery v2 REST API.
- Bigquery Data Policy API
- BigQuery Connections API
- BigQuery Storage API
- BigQuery Data Transfer Service API
- BigQuery Analytics Hub API
- BigQuery Migration API
- BigQuery Reservation API
Data governance for Google BigQuery:
Data governance defines who can, what action, what data, what conditions and what ways.
In a way all the steps that we take to keep our BQ data as per the below will be included in the BQ Data Governance.
- Secure
- Private
- Accurate
- Available
- Usable
Data Governance starts from onboarding of data to its disposal. So, it is important to perform these checks at the very beginning of new data onboarding.
Few initial checks may save your team a lot of your time and effort in future.
- What type of data?
- Identify PII or critical datasets/tables or columns.
- Source of data?
Here with start with interesting stuff, boring yet the important of all.
- Mark PII and Critical Columns/Rows: Data Tagging
Table Tags/Labels part of Metadata: Use Labels tags for your BQ Table as key pairs like PII, Critical.
Data Loss Prevention (DLP): It offers a data inspection, classification, and de-identification platform that has 150 predefined detectors with option to create custom detectors using a dict or regex logic.
Features:
- Pattern scanning at BQ table column and Row Level. Ex: PII like Credit Card.
- May use this service as part of a data pipeline for scanning of violations.
- De-identify or Data Masking: Auto replace the identified pattern in the table.
Column level security: There is no direct way of tagging the columns in BQ, so you would have to use Dataplex(data catalogue) for tagging the column.
- You will have to first create “Taxonomies” (grouping) and then schema based “policy tags” will need to be created and managed under this.
Good example of Taxonomies and policy Tags:
Dynamic Data Masking: BigQuery now supports column level dynamic data masking, so can selectively mask the column data for groups of users, while still allowing access to the column. This functionality of Data masking is built on top of column-level access control, so they be used in combination for better controls.
Row Level Security: It lets you filter data and enables access to specific rows in a table based on qualifying user conditions. You can have multiple row-level and column-level access policies on one table.
Refer to below representation by GCP.
A user with role BigQuery Admin or BigQuery DataOwner, can create row-level access policies on a BigQuery table by specifying the table by name, and which users or groups “grantee-list” can access certain row data.
Ref: Working on BQ table row level policy.
Authorised View: You can write and then share the query results with particular users and groups without giving them access to the source data or the tables. You should combine row level and column level access policies to avoid the information exposure. So, this enables the user to just get access to some parts of the table that can be monitored via Audit logs. (use this with caution)
- Policy Tag Usage:
Access control policies: Set it on the basis of column policy tags, you can set up restriction conditions for each policy tag via IAM policy.
You may want to refer to — Roles used with column-level access control.
- Cloud IAM allows you to manage access by 3 things:
- Member
- Role
- Policy
It defines “Who” can “do What” on “which Resource”?
Important Checks:
- Complete IAM audit logs are available by default.
- Clearly Define User roles.
- Enable SSO and 2fa.
Policy Violation Alerts: Setup cloud alerts based on the policy tags or by default we would be getting the usual policy violations GCP audit logs.
Use Audit logging to monitor for policy violations related to “Authorized views” or “Row-level” access policies.
3rd Party Integration:
If you have integrated BQ with the third party then it is important to look at the credential mechanism and manage it properly. I have listed some of the quick checks for this.
Credentials Types -
Short Lived: Set 1hr Session Timeout
Long Lived Credentials: These keys do not automatically expire, so if leaked then it can lead to unauthorized access. Therefore, key rotation is suggested for long lived credentials.
Network Security: Encryption & Traffic Flow
Data at rest: 100% encrypted at rest in BQ using DMEK then KMS but you may select to use CMEK for encryption.
Column-level encryption: It is newly added feature in July, 2022 as per which the column level encryption enables a connection with Cloud Key Management System(KMS) to provide administrators better control, manage encryption keys in KMS, and enable on-access secure key retrieval and thorough reporting.
Data in Transit: GCP encrypts and authenticates all data in transit when the data moves outside of the physical boundaries controlled by Google or on behalf of Google.
This defends the data, after a connection is established and authenticated, against potential attackers by:
- Removing the need to trust the lower layers of the network, which are commonly provided by third parties.
- Preventing attackers from accessing data if communications are intercepted.
Traffic Restrictions Using VPC:
Service perimeters mitigate data exfiltration risks by restricting data egress and ingress between resources inside the perimeter and resources outside the perimeter.
We should use VPC Service Controls along with IAM access controls to enable broader context-based perimeter security.
Availability — DDOS — Ah! Naa. no.
BigQuery is a Google Cloud service, so when a user or application sends it a request, the request first reaches a globally distributed system called the Google Front End (GFE).
GFE terminates traffic for incoming HTTP(S), TCP, and TLS proxy traffic, provides DDoS attack countermeasures
~Going through all the BigQuery documentation and videos was a herculean task, hope you are able to utilize the listed steps.
Please feel free to let me know for any incorrect details in the article.