In Cloud We Persist: Lowering Cloud Deployment Costs with Changes to Persistent Storage

This is the third blog in Cloud Thoughts, a series delving into the details of deploying ThoughtSpot in the Cloud. Read parts one and two for more.

Enterprises in every industry continue to march toward the cloud, chasing the flexibility and agility needed to thrive in today’s market. While they’re definitely finding these benefits, many are also shocked by the cost of running some applications in the cloud.

At ThoughtSpot, we’re dedicated to lowering the cost of infrastructure required to deploy search and AI-driven analytics in the cloud. That’s why we are so excited to announce changes in the persistent storage layer that make use of object storage to help lower the cost of your ThoughtSpot AWS and GCP deployments by as much as 10%. Not only will these changes help you reduce costs; they’ll also improve your ability to scale the ThoughtSpot cluster without needing to make any changes to your cluster configuration.

New ThoughtSpot Deployment Model

What it is.

It’s a means to lower the cost of a ThoughtSpot AWS or GCP deployment by persisting a significant portion of user and application data in object storage (AWS S3 and GCP Google Cloud Storage).

What it’s not.

It’s not a way to run ThoughtSpot queries directly against data stored in your S3 or GCS bucket. User data must still be loaded into memory, with queries being executed on data stored in RAM on the VM instance. The S3 or GCS bucket is controlled by the ThoughtSpot application and serves as the persistence layer that helps retrieve data into memory on the instance (during data loads or if the VM restarts).

As laid out in a prior blog post, ThoughtSpot is deployed in the cloud using a combination of cloud compute (VM) instances and an underlying persistent storage layer that form a cluster of nodes. These nodes power ThoughtSpot's massively parallel in-memory calculation engine to execute thousands of queries and uncover interesting insights on billions of rows of data.

The persistent storage layer is responsible for retrieving all user data and user-generated analytics content into memory (RAM) on the cluster nodes (VM instances in cloud).

Prior to 5.3, ThoughtSpot’s persistent storage implementation on AWS used Amazon Elastic Block Store (EBS) as the only underlying storage infrastructure, with a Hadoop Distributed File System (HDFS) layer to enforce redundancy across nodes in a cluster. Similarly, the GCP deployment was backed by SSD Persistent Disk before the release of 6.0.

In this deployment model, persistent storage capacity (AWS EBS or GCP SSD Persistent Disk) needs to be provisioned per VM. The size of the persistent storage volumes is based on the amount of user data that can be loaded on to each VM instance.

The table below shows the AWS EBS and GCP SSD Persistent disk volume size that must be provisioned per VM for each supported VM instance type on AWS and GCP (in addition to the boot volume).

AWS VM instance type	User data capacity per VM	EBS data volumes per VM
r4.4xlarge, r5.4xlarge	20GB	2x400GB
r4.8xlarge, r5.8xlarge	100GB	2x400GB
r4.16xlarge, r5.16xlarge	250GB	2x1TB
m5.24xlarge	192GB	2x1TB
r5.24xlarge	384GB	2x1.5TB

GCP VM Instance type	CPU/RAM	Per VM user data capacity	Zonal Persistent SSD Disk volume per VM
n1-highmem-16	16/122	20 GB	2X 400 GB
n1-highmem-32	32/208	100 GB	2X 400 GB
n1-standard-96	96/330	180 GB	2X 1 TB
n1-highmem-64	64/416	208 GB	2x 1 TB

Object storage is cheaper than block storage in all the major cloud provider environments like AWS, Azure and GCP. Utilizing object stores can help lower the infrastructure cost for ThoughtSpot AWS deployments by up to 10%.

Beginning with the 5.3 release for AWS and the 6.0 release in GCP, we now support a deployment model that uses AWS S3 and Google Cloud Storage (GCS) for a major portion of persistent storage.

In this deployment model, a major part of the user data and some of the application data is persisted in AWS S3 or Google Cloud Storage, while a reduced (fixed) footprint of AWS EBS or GCP SSD Persistent Disk is used for other application data. Using S3 or GCS for the majority of persistent data reduces the total infrastructure cost of deploying ThoughtSpot. The size of persistent data goes down significantly in this model.

Here’s the deployment footprint with this model.

AWS VM instance type	User data capacity per VM	EBS data volumes	S3 bucket size
r4.4xlarge, r5.4xlarge	20GB	1x500GB	Approximately equal to the size of user CSV data (plus a small amount of indexing and other application data)
r4.8xlarge, r5.8xlarge	100GB	1x500GB
r4.16xlarge, r5.16xlarge	250GB	1x500GB
m5.24xlarge	192GB	2x1TB
r5.24xlarge	384GB	2x1.5TB

GCP VM Instance type	Per VM user data capacity	SSD persistent disk	GCS bucket size
n1-highmem-16	20 GB	1x500GB	Approximately equal to the size of user CSV data (plus a small amount of indexing and other application data)
n1-highmem-32	100 GB	1x500GB
n1-standard-96	180 GB	1x500GB
n1-highmem-64	208 GB	1x500GB
n1-highmem-96	312 GB	1x500GB

Using S3 or GCS also achieves compute storage separation, due to the fact that persisted data can now scale independently of compute. In order to learn more about provisioning S3 or GCS as persistent storage, please visit the Cloud section of our documentation.

As we continue to find ways to help you get more value out of your data in the cloud, this is just the beginning. We are working on supporting the object-store-based deployment model on the Azure platform as well, so stay tuned for that announcement as well as several other exciting developments.

Category:

In Cloud We Persist: Lowering Cloud Deployment Costs with Changes to Persistent Storage

Product

Solutions

Industry

Resources

How we compare

Company

Category:

In Cloud We Persist: Lowering Cloud Deployment Costs with Changes to Persistent Storage

Subscribe to our blog

Product

Solutions

Industry

Resources

How we compare

Company