Introducing gcloudy

Introducing `gcloudy`

A more intuitive approach to using Google's existing Google Cloud Platform (GCP) framework of cloud services in Python

February 28, 2022

Introduction

Google Cloud Platform (GCP) contains a massive framework of interconnected services. While Google provides its own Python package for working within GCP, in practice it is often overly verbose and non-intuitive, especially to new users. To simplify the user experience, I built the gcloudy package, with a focus on the most used GCP services, such as BigQuery, Cloud Functions, etc.

The gcloudy package is not meant to be a replacement for GCP power-users, but rather an alternative for GCP users who are interested in using Python in GCP to deploy Cloud Functions and interact with certain GCP services, especially BigQuery and Google Cloud Storage. In fact, it is built on top of cononical Google Python packages(s) without any alteration to Google's base code.

Essentially, the idea is to go from this:


from google.cloud import bigquery

gcp_client = bigquery.Client()

my_table_schema = [
    bigquery.SchemaField(name = "string_column", field_type = "STRING", mode = "NULLABLE", ...),
    bigquery.SchemaField(name = "integer_column", field_type = "INT64", mode = "NULLABLE", ...),
    bigquery.SchemaField(name = "float_column", field_type = "FLOAT", mode = "NULLABLE", ...),
    bigquery.SchemaField(name = "date_column", field_type = "DATE", mode = "NULLABLE", ...)
]

job_config = bigquery.LoadJobConfig(
    autodetect = False,
    schema = my_table_schema,
    write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
)

load_job = gcp_client.load_table_from_dataframe(my_df, "my-project-id.my-dataset.my-table", job_config = job_config)

load_job.result()

To this:


from gcloudy.GoogleCloud import BigQuery

bq = BigQuery("my-project-id")

bq.write_bigquery(my_df, "my-dataset.my-table")

Installing & Initializing

Installation is done via pip:


$ pip install gcloudy

Here we initialize a BiqQuery instance to work with tables stored in BQ:


from gcloudy.GoogleCloud import BigQuery

bq = BigQuery("my-gcp-project-name")

The above block assumes that your code-execution environment (Python Notebook, Cloud Function, etc) has the same compute engine IAM permissions as your BigQuery environment, and therefore a specific GCP Client does not have to be defined.

To instead initialize using a specific service account key, pass the full local path to your key.json:


from gcloudy.GoogleCloud import BigQuery

bq = BigQuery(
    project_id = "gcp-project-name",
    service_acct_key_path = "/full/path/to/your/service/account/key.json"
)

Reading Tables from BigQuery

The most straightforward way to read a table in from BigQuery is by using the .read_bigquery() method. The args are as follows:


bq.read_bigquery

- Read an existing BigQuery table into a DataFrame

read_bigquery(bq_dataset_dot_table = None, date_cols = [], preview_top = None, to_verbose = True)

bq_dataset_dot_table : the "dataset-name.table-name" path of the existing BigQuery table
date_cols : [optional] column(s) passed inside a list that should be parsed as dates
preview_top : [optional] only read in the top N rows
to_verbose : should info be printed? defaults to True
use_polars : [NEW IN 2.0.0] should a polars DataFrame be returned instead of a pandas DataFrame? Defaults to False

As an example, suppose your BigQuery table is located in a dataset called "my-dataset", with the table name "my-table".


my_table = bq.read_bigquery("my-dataset.my-table")

The above code reads "my-table" in as a Pandas DataFrame.

If you would like to instead read the table in as a Polars DataFrame, set the use_polars arg to True:


my_table = bq.read_bigquery("my-dataset.my-table", use_polars = True)

Another method for reading tables in from BigQuery is .read_custom_query(). This method allows you to specify custom SQL code. Its args are as follows:


bq.read_custom_query

- Read in a custom BigQuery SQL query into a DataFrame

read_custom_query(custom_query, to_verbose = True, use_polars = False)

custom_query : the custom BigQuery SQL query that will produce a table to be read into a DataFrame
to_verbose : should info be printed? defaults to True
use_polars : [NEW IN 2.0.0] should a polars DataFrame be returned instead of a pandas DataFrame? Defaults to False

As an example, suppose your BigQuery table, "my-table", contains multiple columns, but you are only interested in reading in three of them - "Date", "Quantity", and "Sales", and you only care about observations where "Quantity" is above 50:


my_custom_table = bq.read_custom_query("""
    SELECT
        Date,
        Quantity,
        Sales
    FROM
        my-gcp-project-name.my-dataset.my-table
    WHERE
        Quantity > 50
""")

Writing tables to BigQuery

Writing to BigQuery is done using the .write_bigquery() method. It contains the following args:


bq.write_bigquery

- Write a DataFrame to a BigQuery table

write_bigquery(df, bq_dataset_dot_table = None, use_schema = None, append_to_existing = False, to_verbose = True)

df : the DataFrame to be written to a BigQuery table
bq_dataset_dot_table : the "dataset-name.table-name" path of the existing BigQuery table
use_schema : [optional] a custom schema for the BigQuery table
append_to_existing : should the DataFrame be appended to an existing BigQuery table? defaults to False
to_verbose : should info be printed? defaults to True

As an example, suppose you want to write a new BigQuery table to your "my-dataset" dataset:


bq.write_bigquery(my_data_frame, "my-dataset.my-new-table")

By default the new table is set to overwrite an existing. In order to append it instead, set the append_to_existing arg to True:


bq.write_bigquery(my_data_frame, "my-dataset.my-new-table", append_to_existing = True)

Executing SQL code remotely in BigQuery

Sometimes your task does not involve reading or writing tables, but rather executing some SQL code remotely. The .send_query() method allows you to do that, with following args:


bq.send_query

- Send a custom SQL query to BigQuery. Process is carried out within BigQuery. Nothing is returned.

send_query(que, to_verbose = True)

que : the custom SQL query to be sent and carried out within BigQuery
to_verbose : should info be printed? defaults to True

As an example, suppose that you want to create a new table named "my-created-table" using the following conditions applied to your "my-table" table:


bq.send_query("""
    CREATE TABLE my-gcp-project-name.my-dataset.my-created-table AS
    (
        SELECT
            Date,
            Quantity,
            Sales
        FROM
            my-gcp-project-name.my-dataset.my-table
        WHERE
            Quantity > 50
    )
""")

Summary

In this article I have outlined just some of the methods included in my gcloudy library. Because these methods are intuitive and easy to use, they are ideal to be chained together in any ETL task that involves data stored in a BigQuery environment.

Thanks for reading!