How Estuary helps enterprises harness historical and real-time data pipelines

The Transform Technology Summits start October 13th with Low-Code/No Code: Enabling Enterprise Agility. Register now!

Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

It is often said that the world’s most valuable resource today is data, given the role it plays in driving all manner of business decisions. But combining data from myriad disparate sources such as SaaS applications to unlock insights is a major undertaking, one that is made all the more difficult when real-time, low-latency data streaming is the name of the game.

This is something that New York-based Estuary is setting out to solve with a “data operations platform” that combines the benefits of “batch” and “stream” processing data pipelines.

“There’s a Cambrian explosion of databases and other data tools which are extremely valuable for businesses but difficult to use,” Estuary cofounder and CEO David Yaffe told VentureBeat. “We help clients get their data out of their current systems and into these cloud-based systems without having to maintain infrastructure, in a way that’s optimized for each of them.”

To help in its mission, Estuary today announced that it has raised $7 million in a seed of funding led by FirstMark Capital, with participation from a slew of angel investors including Datadog CEO Olivier Pomel and Cockroach Labs CEO Spencer Kimball.

The state of play

Batch data processing, for the uninitiated, describes the concept of integrating data in batches at fixed intervals — this might be useful for processing last week’s sales data to compile a departmental report. Stream data processing, on the other hand, is all about harnessing data in real time as it’s generated — this is more useful if a company wants to generate quick insights on sales as they happen, for example, or where customer support teams need all the recent data about a customer such as their purchases and website interactions.

While there has been significant progress in the batch data processing sphere in terms of being able to extract data from SaaS systems with minimal engineering support, the same can’t be said for real-time data. “Engineers who work with lower latency operational systems still have to manage and maintain a massive infrastructure burden,” Yaffe said. “At Estuary, we bring the best of both worlds to data integrations. The simplicity and data retention of batch systems, and the [low] latency of streaming.”

Above: An Estuary conceptualization

Achieving all the above is already possible through existing technologies, of course. If a company wants low latency data capture, they can use various open source tools such as Plusar or Kafka to set up and manage their own infrastructure. Or they can use existing vendor-led tools such as HVR, which Fivetran recently acquired, although that is mostly focused on capturing real-time data from databases, with limited support for SaaS applications.

This is where Estuary enters the fray, offering a fully-managed ELT (extract, load, transform) service “that combines both millisecond-latency and point-and-click simplicity,” the company said, bringing open source connectors similar to Airbyte to low-latency use cases.

“We’re creating a new paradigm,” Yaffe said. “So far, there haven’t been products to pull data from SaaS applications in real-time — for the most part, this is a new concept. We are bringing, essentially, a millisecond latency version of Airbyte which works across SaaS, database, pub/sub, and filestores to the market.”

There has been an explosion of activity across the data integration space of late, with Dbt Labs raising $150 million to help analysts transform data in the warehouse, while Airbyte closed a $26 million round of funding. Elsewhere, GitLab spun out an open source data integration platform called Meltano. Estuary certainly jives with all these technologies, but its focus on both batch and stream data processing is where it wants to set itself apart, covering more use cases in the process.

“It’s such a different focus that we don’t see ourselves as competitive with them, but some of the same use cases could be accomplished by either system,” Yaffe said.

The story so far

Yaffe was previously cofounder and CEO at Arbor, a data-focused martech company he sold to LiveRamp in 2016. At Arbor, they created Gazette, the backbone upon which its managed commercial service Flow — which is currently in private beta — is built on.

Enterprises can use Gazette “as a replacement for Kafka,” according to Yaffe, and it has been entirely open source since 2018. Gazette builds a real-time data lake that stores data as regular files in the cloud and allows users to integrate with other tools. It can be a useful solution on its own, but it still needs considerable engineering resources to use as part of a holistic ELT tool set, which is where Flow comes into play. Companies use flow to integrate all the systems they use to generate, process, and consume data, unifying the “batch vs streaming paradigms” to ensure that a company’s current and future systems are “synchronized around the same data sets.”

Flow is source-available, meaning that it offers many of the freedoms associated with open source, except its Business Source License (BSL) prevents developers from creating competing products from the source code. On top of that, Estuary licenses a fully-managed version of Flow.

“Gazette is a great solution in comparison to what many companies are doing today, but it still requires talented engineering teams to build and operate applications that will move and process their data — we still think this is too much of a challenge compared to the simpler ergonomics of tooling within the batch space,” Yaffe explained. “Flow takes the concept of streaming which Gazette enables, and makes it as simple as Fivetran for capturing data. The enterprise uses it to get that type of advantage without having to manage infrastructure or be experts in building & operating stream processing pipelines.”

While Estuary doesn’t publish its pricing, Yaffe said that it charges based on the amount of input data that Flow captures and processes each month. In terms of existing customers, Yaffe wasn’t at liberty to divulge any specific names, but he did say that its typical client operates in martech or adtech, while enterprises also use it to migrate data from an on-premises database to the cloud.

VentureBeat

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Source: Read Full Article