Let the Data Flow! Introducing ThoughtSpot DataFlow

ThoughtSpot’s in-memory data cache, Falcon, revolutionized our customer’s ability to search for their data at the speed of thought. It’s speed allowed our users to be able to analyze along any dimension without there being any time penalty, giving them a free reign to do any kind of analysis.

Many of our customers leverage their existing ETL solutions and simply added ThoughtSpot as a target destination for their ETL flows. However, this required technical expertise, and many of our line-of business users like those in departments like Marketing and Sales didn’t have access to such tools or technical programmers. To cater to this demand, we are very excited to announce the general availability of DataFlow.

DataFlow is a brand new capability in ThoughtSpot through which users can easily move data into ThoughtSpot from more than 2 dozen databases and filesystems. This will be through a very easy to use graphical user interface in the browser and doesn’t need a separate server to be setup as it will be running on the ThoughtSpot server itself. Check out the quick video below on how DataFlow works:

 

While traditional ETL tools can be very complex for anyone other than technical data engineers, we have kept DataFlow very simple to use so that data analysts can use it without having to go through any kind of training. Through it’s point and click UI and graded complexity of features, it is able to keep things simple when the requirements are not much but at the same time cater to much more complex requirements when needed.

What makes this even more interesting is the underlying architecture. DataFlow leverages ThoughtSpot’s high speed tsload bulkload API, as opposed to JDBC/ODBC. Doing so allows DataFlow to achieve speed at scale for ingesting billions of rows into memory.

DataFlow supports the most common databases, data warehouses, files and applications. The list of supported data sources will continue to grow release over release as we certify additional sources over time.

Here are some of the other features of DataFlow

  • Load Incremental Data - Many times the source data could have many years of data and it becomes infeasible to load all that data on a daily basis. Through DataFlow you can specify the filter conditions to get only the latest data.

  • Granular Selection - Very rarely would you want to load the entire table or file into ThoughtSpot. DataFlow allows you to select a subset of the columns into ThoughtSpot.

  • Data Mapping - DataFlow gives you a very easy to use interface to create new tables in ThoughtSpot based on external tables or files along with specifying the Primary Keys and Sharding Keys for them. DataFlow also allows you to load to existing ThoughtSpot tables along with the option to map to different columns in them.

  • Sync Scheduling - What is the point of setting up data loads if you can’t schedule them. DataFlow gives you a number of scheduling choices down to the hourly level. You can also make the syncs get triggered upon other activities like the presence of a file. Also you can have TQL scripts run before or after the data load for things like cleaning up the data.

  • TQL Interface - While DataFlow gives you a user friendly way to create the tables, there are times when you want to run complex TQL statements like altering the sharding keys. For this reason, we have created a “TQL Editor” for you to be able to run such TQL commands. They also come with the added advantage of being secure in that you will be able to modify tables that you have access to.

  • Alerts & Monitoring - DataFlow also has user friendly ways to monitor the data sync at the cluster level as well as the individual table level. You can also set up alerts for problems that need to be addressed right away and also have the ability to view detailed logs.

  • Pre & Post Handling of Files - With files, you have more capabilities around loading files that are compressed or move it from an FTP/SFTP  location or block storages like Amazon S3, Azure Blob Storage, Google Cloud Storage or even HDFS. Also upon completion, you can have those files archived or deleted so as to not fill up the disks. 

This is just the start of the DataFlow journey. We have much more great features on the DataFlow roadmap like being able to load data into Embrace databases, being able to do transformations on the data before it gets loaded and better triggers for the data loads.

Going by the feedback that we have received around this feature from our Beta customers, we’re very excited about it and what new use cases it can open up for our customers. For more information, please visit https://thoughtspot.com/dataflow. We look forward to hearing your feedback.