09/24/2025 | News release | Distributed by Public on 09/23/2025 22:59
As a global software-as-a-service (SaaS) company specializing in providing intuitive, AI-powered business solutions, designed to enhance customer and employee experiences. Freshworks depends on real-time data to power decision-making and deliver better experiences to its 75,000+ customers. With millions of daily events across products, timely data processing is crucial. To meet this need, Freshworks has built a near-real-time ingestion pipeline on Databricks, capable of managing diverse schemas across products and handling millions of events per minute with a 30-minute SLA-while ensuring tenant-level data isolation in a multi-tenant setup.
Achieving this requires a powerful, flexible, and optimized data pipeline-which is exactly what we were set out to build.
Freshworks' legacy pipeline was built with Python consumers; where each user action triggered events sent in real time from products to Kafka and the Python consumers transformed and routed these events to new Kafka topics. A Rails batching system then converted the transformed data into CSV files stored in AWS S3, and Apache Airflow jobs loaded these batches into the data warehouse. After ingestion, intermediate files were deleted to manage storage. This architecture was well-suited for early growth but soon hit limits as event volume surged.
Rapid growth exposed core challenges:
As scale and complexity increased, the fragility and overhead of the old system made clear the need for a unified, scalable, and autonomous data architecture to support the business growth and analytics needs.
The solution - A foundational redesign centred on Spark Structured Streaming and Delta Lake, purpose-built for near-real-time processing, scalable transformations, and operational simplicity.
We designed a single, streamlined architecture where Spark Structured Streaming directly consumes from Kafka, transforms data, and writes it into Delta Lake-all in one job, running entirely within Databricks.
This shift has reduced data movement, simplified maintenance and troubleshooting, and accelerated time-to-insight.
The Streaming Component : Spark Structured Streaming
Each incoming event from Kafka passes through a carefully orchestrated series of transformation steps in Spark streaming; optimized for accuracy, scale, and cost-efficiency:
The Storage Component: Lakehouse
Once transformed, the data is written directly to Delta Lake tables using several powerful optimizations:
Autoscaling & Adapting in Real Time
Autoscaling is built into the pipeline to ensure that the system scales up or down dynamically to handle volume and cost most efficiently without impacting performance.
Autoscaling is driven by batch lag and execution time, monitored in real time. Resizing is triggered via job APIs through Spark's QueryListener (OnProgress method after each batch), ensuring in-flight processing isn't disrupted. This way the system is responsive, resilient, and efficient without manual intervention.
Built-In Resilience: Handling Failures Gracefully
To maintain data integrity and availability, the architecture includes robust fault tolerance:
Observability and Monitoring at Every Step
A powerful monitoring stack-built with Prometheus, Grafana, and Elasticsearch-integrated with Databricks gives us end-to-end visibility:
Transformation & Batch Execution Metrics:
Track transformation health using above metrics to identify issues and trigger alerts for quick investigations
From Complexity to Confidence
Perhaps the most transformative shift has been in simplicity.
What once involved five systems and countless integration points is now a single, observable, autoscaling pipeline running entirely within Databricks. We've eliminated brittle dependencies, streamlined operations, and enabled teams to work faster and with greater autonomy.Essentially Fewer moving parts meant Fewer surprises & More confidence.
By reimagining the data stack around streaming and the Deltalake, we've built a system that not only meets today's scale but is ready for tomorrow's growth.