Jan 2025
27 Mon
28 Tue
29 Wed
30 Thu
31 Fri 05:45 PM – 08:30 PM IST
1 Sat
2 Sun
Sandeep Joshi
@sand777 Curator
Submitted Feb 4, 2025
Achal Shah, Tech Lead Manager at Tecton, discussed the evolving landscape of data processing.
Traditionally, distributed processing engines like Spark and Storm were essential for handling large datasets, but advancements in computing power have changed this. Many companies now prefer alternatives like Ray, Snowflake, Materialize, and DuckDB1, questioning the necessity of distributed systems when high-performance single-node solutions can suffice.
Achal highlighted a shift in data architecture trends, where the lakehouse model—storing raw data and transforming it as needed—has become the standard. Open table formats and messaging systems are now widely adopted, even among companies migrating from legacy systems like HDFS. The once-debated distinction between data lakes and data warehouses is fading, as companies, including Snowflake, now support formats like Iceberg.
He also discussed the evolution of Spark programming, highlighting how Python has replaced Scala as the primary language for data engineering, making it easier to define data sources and sinks for batch and stream jobs. They note that Python has become the de facto API for data tools, often acting as a wrapper for high-performance code, and mention Rust-based tools like UV improving the Python ecosystem.
Achal discussed modern data infrastructure trends, emphasizing the shift from Hadoop-based HDFS deployments to object storage solutions like S3 with Delta, Iceberg, and Hoodie table formats. They highlight Unity Catalog as a dominant solution in the Databricks ecosystem, while open alternatives like OpenLineage exist but see limited traction.
Regarding databases, Achal describes DuckDB as a high-performance, in-memory SQL database that now supports disk-based storage for handling large datasets. They touch on efficiency challenges related to chunking and spillover mechanisms.
Achal has observed a shift in the industry from platform-building to specialized end-user applications, emphasizing that leveraging existing foundation models and tools is now more valuable than developing slow-evolving platforms.
He discussed the rapid evolution of foundational AI models and how this impacts the industry. They argue that building platforms for AI applications is becoming less valuable, while solving specific end-user problems holds more potential.
Previously, companies focused on platform-based solutions for data engineering, observability, and MLOps, but now, with well-defined infrastructure options available, the industry is shifting toward solving concrete user challenges—such as fraud detection—using existing AI technologies.
The discussion also touched on the increasing importance of real-time machine learning, which has enabled companies in financial services, insurance, and other industries to make faster and more efficient decisions. The rise of companies specializing in optimizing AI inference, such as Fireworks and Base Ten, reflects this trend.
Achal suggested that AI-driven businesses will increasingly focus on real-time, user-facing applications rather than infrastructure-building over the next five years.
The discussion also covered fine-tuning models for specific tasks, optimizing model deployment, and streaming data processing.
Fine-Tuning & Model Deployment:
Streaming & Real-Time Processing:
Model Performance & Testing:
Data Management & Infrastructure:
Emerging Trends & Considerations:
The discussion emphasized the benefits of cloud-native architectures and disaggregated storage and compute models to reduce costs and improve scalability.
Achal referred to this blog post during the presentation - https://motherduck.com/blog/big-data-is-dead/ ↩︎
Hosted by
Supported by
Meetup Sponsor
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}