

Written in collaboration with Shashank Agarwal
In this article we will go through the steps of migrating data from Bigquery to Memorystore using a Dataproc pipeline. We will go through preparing a Bigquery dataset, creating a Memorystore instance, ensuring high availability, creating a Dataproc cluster, and running the job to migrate your data.
Architecture diagram:
- A GCP Project
- Install gcloud
First you want to identify the Bigquery dataset to migrate. Take note of the size of the dataset. For this example I am using the public bigquery-public-data.stackoverflow.post_history
dataset that is 113GB in size. Redis represents data in a key:value pair. Because of this, your dataset in Bigquery must only have one single primary key column.
Create a Memorystore instance larger than your Bigquery dataset. When migrating from Bigquery to Memorystore the data size can increase due to the additional amount of storage used to store keys. In this case a 113GB dataset in Bigquery resulted in a 130GB dataset in Memorystore. Below is the gcloud command to create a simple Memorystore instance.
gcloud redis clusters create my-cluster \
--region= \
--network=projects//global/networks/ \
--replica-count=1 \
--node-type=redis-highmem-medium \
--shard-count=15
A redis-highmem-medium node has a capacity of 13GB per shard. With 15 shards, this Memorystore instance has 195GB of memory. Learn more about node types here: Cluster and node specification | Memorystore for Redis Cluster | Google Cloud
A prerequisite to creating a Memorystore cluster is having a service connection policy. Make sure the service connection policy is created in the same region as your Memorystore Redis Cluster. Learn more about networking prerequisites here: Networking overview | Memorystore for Redis Cluster | Google Cloud
gcloud network-connectivity service-connection-policies create my-scp \
--network= \
--project= \
--region= \
--service-class=gcp-memorystore-redis \
--subnets=https://www.googleapis.com/compute/v1/projects//regions//subnetworks/
You do not need to create a Memorystore dataset. The dataset will be automatically created via the Dataproc job.
Cross Region Replica’s ensure high availability in the case of regional outages. Memorystore allows up to 4 cross regional replicas. Similar to the primary cluster, a service connection policy needs to be created in each region a cross-region replica exists.
gcloud network-connectivity service-connection-policies create my-scp-2 \
--network= \
--project= \
--region= \
--service-class=gcp-memorystore-redis \
--subnets=https://www.googleapis.com/compute/v1/projects//regions//subnetworks/gcloud redis clusters create my-cross-region-replica \
--project= \
--region= \
--cross-cluster-replication-role=secondary \
--network=projects//global/networks/ \
--primary-cluster=projects//locations//clusters/
--node-type=redis-highmem-medium \
--shard-count=15
Below is an example of a Memorystore configuration with 1 Primary in us-east1 and 4 cross region replicas. Each replica has it’s own discovery endpoint which you can use to connect via the redis-cli using redis-cli -h
.
Now that we have our destination Memorystore Redis Cluster created and configured with cross-region replicas, we can start looking at the migration plan. For this migration we will be using the bigquery_to_memorystore Dataproc template found here: bigquery_to_memorystore.py. This job can be run either in serverless mode or on a cluster. For this example we will be running this job on a Dataproc cluster.
gcloud dataproc clusters create my-redis-cluster \
--region= \
--properties="dataproc:pip.packages=google-cloud-secret-manager==2.19.0"
README.md: dataproc-templates/python/dataproc_templates/bigquery/README.md at main
Follow this README to create the appropriate environment variables and run the script.
Below is the command which can be used to run this job. We have set the time to live (ttl) as 24hours, you can change/remove this setting according to your requirements.
./bin/start.sh \
-- --template=BIGQUERYTOMEMORYSTORE \
--bigquery.memorystore.input.table=bigquery-public-data.stackoverflow.post_history \
--bigquery.memorystore.output.host= \
--bigquery.memorystore.output.port=6379 \
--bigquery.memorystore.output.table=post_history \
--bigquery.memorystore.output.key.column=id \
--bigquery.memorystore.output.model=hash \
--bigquery.memorystore.output.mode=overwrite \
--bigquery.memorystore.output.ttl=86400 \
--bigquery.memorystore.output.dbnum=0
Arguments reference:
Required Arguments:
bigquery.memorystore.input.table
: BigQuery Input table name (format: project.dataset.table)bigquery.memorystore.output.host
: Redis Memorystore hostbigquery.memorystore.output.table
: Redis Memorystore target table namebigquery.memorystore.output.key.column
: Redis Memorystore key column for target table
Optional Arguments
bigquery.memorystore.output.port
: Redis Memorystore port. Defaults to 6379bigquery.memorystore.output.model
: Memorystore persistence model for Dataframe (one of: hash, binary) (Defaults to hash)bigquery.memorystore.output.mode
: Output write mode (one of: append,overwrite,ignore,errorifexists) (Defaults to append)bigquery.memorystore.output.ttl
: Data time to live in seconds. Data doesn’t expire if ttl is less than 1 (Defaults to 0)bigquery.memorystore.output.dbnum
: Database / namespace for logical key separation (Defaults to 0)
Connect to your Memorystore clusters by using redis-cli -h
and query data to confirm migration. Additionally, ensure all keys were migrated from Bigquery to Memory by validating the Total Keys count in Metrics Explorer.
Monitoring Memorystore
Monitoring the Dataproc job
Migrating from Bigquery to Memorystore is easy to do with the Dataproc template! Be aware of the change of the data size and configure cross-region replicas to ensure high availability.
Source Credit: https://medium.com/google-cloud/migrate-from-bigquery-to-multi-regional-memorystore-using-dataproc-000b6028bf4a?source=rss—-e52cf94d98af—4