Writing a pandas dataframe to cloud storage

Corin, Wed Nov 13 2024 • Python pandas cloud

In search of simple data sharing

I've often wanted a workflow where I could process data in a Jupyter notebook using pandas, then quickly upload the contents of the dataframe to the cloud to get an easily shareable URL.

This week I finally made that happen using Google Cloud Storage and some associated libraries. The most minimal version of the workflow is now as simple as:

df = pd.DataFrame(...somedata)
df.to_csv('gs://bucket_name/filename.csv')

With the file then available at storage.cloud.google.com/bucket_name/filename.csv

Since I had to get familiar with some new software along the way, I wanted to note down the process step-by-step.

The Google Cloud part

Make a Google Cloud account

Google Cloud is a set of entreprise and/or developer-focused cloud services, and so is not enabled for a standard Google account by default.

First I had to visit cloud.google.com and sign up for the service. This took me to the fairly complicated looking Google Cloud dashboard, where I created a new project named Datahub.

Datahub cloud project

Create a publicly accessible storage bucket

I spent a while looking through all the different options until I found what I was looking for in the 'Storage' button under the Resources tab.

This took me to a less complicated looking screen where I could create a new cloud storage bucket. At some point I was prompted to add my credit card info, because Google Cloud is a paid product – though it will only cost a few cents per month for the low volume of data transfer that I'll need.

New bucket

When creating the bucket I deselected the option to enforce public access prevention, meaning that data stored in the bucket will be viewable in read-only mode without authentication.

This would be risky security protocol for any sensitive information, but is fine for projects where we're sharing non-confidential data in order to collaborate on dataviz work.

Bucket dialogue

The Python part

Install Google Cloud's CLI and Python library

Back on my local machine, I needed to install the Google Cloud CLI, and then set up Application Default Credentials (ADC) with the command line tool:

$ gcloud init
$ gcloud auth application-default login

The point of ADC is to store a set of credentials that can be accessed by the environment that the local code is running in, meaning there's no need to read a specific file containing application password, API key, etc.

After that, I needed to install the relevant libraries to interface with the cloud storage using Python.

First I installed the google-cloud-storage library into a new Conda environment (which is how I manage data journalism projects):

$ conda create --name cloud
$ conda install conda-forge::google-cloud-storage

Then I tried a code sample provided by Google to authenticate with a cloud storage bucket using the ADC credentials.

from google.cloud import storage

def authenticate_implicit_with_adc(project_id="your-project-id"):
    storage_client = storage.Client(project=project_id)
    buckets = storage_client.list_buckets()
    print("Buckets:")
    for bucket in buckets:
        print(bucket.name)
    print("Listed all storage buckets.")

This returned the name of my newly created storage bucket. Success!

Next I wanted to figure out the file upload.

Adapting examples from the Python documentation

The Google Cloud team has created a GitHub repo with code samples for interacting with cloud storage using Python.

After working through a few of those, I was able to write a simple, lightweight function to upload a dataframe to the cloud directly from memory (i.e. no need to read/write a local file first).

import pandas as pd
from google.cloud import storage

def upload_csv_to_gcloud(df,filename,project,bucket):
    # Convert dataframe to CSV string buffer
    csv = df.to_csv(index=False)

    # Initialize GCS client
    client = storage.Client(project=project)
    bucket = client.get_bucket(bucket)
    blob = bucket.blob(filename)

    # Upload with content type specified
    blob.upload_from_string(csv, content_type='text/csv')

df = pd.DataFrame({'letter':['a','b','c'],
                    'number':[1,2,3]})

upload_csv_to_gcloud(df,
                    'data.csv',
                    'my_cloud_project',
                    'cloud_bucket_name')

Bonus: use fs libraries to write directly from pandas

The code above is a flexible and readable way to achieve the goal, as it's clear what the function is doing, and easy to build in more functionality if we need it. But after reading some Stack Overflow I found a less explicit but much more simple way to achieve the same result.

First we need to install two more libraries:

fsspec so that pandas can read from and write to remote file systems
gcsfs as a file system interface specifically for Google Cloud Storage

Then all we need to do is:

df.to_csv('gs://bucket/filepath.csv')

And we can write our CSV file directly to the cloud!

We can't explicitly specify a file type with this method, so the file in the storage bucket is stored as type application/octet-stream rather than CSV or plaintext:

Octet stream csv

But downloading the file will render it as a conventional CSV, so in some cases this might be an acceptable trade for the sheer simplicity.

Overall though, once the initial set up is done, it's very quick either way!