Achal Shah

Achal Shah

@achalshah Presenter

Sandeep Joshi

@sand777 Curator

Summary of the meet-up discussion

Submitted Feb 4, 2025

Achal Shah, Tech Lead Manager at Tecton, discussed the evolving landscape of data processing.

  • Traditionally, distributed processing engines like Spark and Storm were essential for handling large datasets, but advancements in computing power have changed this. Many companies now prefer alternatives like Ray, Snowflake, Materialize, and DuckDB1, questioning the necessity of distributed systems when high-performance single-node solutions can suffice.

  • Achal highlighted a shift in data architecture trends, where the lakehouse model—storing raw data and transforming it as needed—has become the standard. Open table formats and messaging systems are now widely adopted, even among companies migrating from legacy systems like HDFS. The once-debated distinction between data lakes and data warehouses is fading, as companies, including Snowflake, now support formats like Iceberg.

  • He also discussed the evolution of Spark programming, highlighting how Python has replaced Scala as the primary language for data engineering, making it easier to define data sources and sinks for batch and stream jobs. They note that Python has become the de facto API for data tools, often acting as a wrapper for high-performance code, and mention Rust-based tools like UV improving the Python ecosystem.

  • Achal discussed modern data infrastructure trends, emphasizing the shift from Hadoop-based HDFS deployments to object storage solutions like S3 with Delta, Iceberg, and Hoodie table formats. They highlight Unity Catalog as a dominant solution in the Databricks ecosystem, while open alternatives like OpenLineage exist but see limited traction.

  • Regarding databases, Achal describes DuckDB as a high-performance, in-memory SQL database that now supports disk-based storage for handling large datasets. They touch on efficiency challenges related to chunking and spillover mechanisms.

  • Achal has observed a shift in the industry from platform-building to specialized end-user applications, emphasizing that leveraging existing foundation models and tools is now more valuable than developing slow-evolving platforms.

  • He discussed the rapid evolution of foundational AI models and how this impacts the industry. They argue that building platforms for AI applications is becoming less valuable, while solving specific end-user problems holds more potential.

  • Previously, companies focused on platform-based solutions for data engineering, observability, and MLOps, but now, with well-defined infrastructure options available, the industry is shifting toward solving concrete user challenges—such as fraud detection—using existing AI technologies.

  • The discussion also touched on the increasing importance of real-time machine learning, which has enabled companies in financial services, insurance, and other industries to make faster and more efficient decisions. The rise of companies specializing in optimizing AI inference, such as Fireworks and Base Ten, reflects this trend.

  • Achal suggested that AI-driven businesses will increasingly focus on real-time, user-facing applications rather than infrastructure-building over the next five years.

The discussion also covered fine-tuning models for specific tasks, optimizing model deployment, and streaming data processing.

  1. Fine-Tuning & Model Deployment:

    • Companies can fine-tune large language models for specific use cases and deploy optimized, smaller models.
    • These models can be served through endpoints compatible with OpenAI’s API, allowing businesses to use them efficiently.
    • Techniques like LoRa enable efficient weight-swapping to support multiple fine-tuned models on a single endpoint.
  2. Streaming & Real-Time Processing:

    • Different streaming architectures impact latency.
    • Spark Streaming achieves ~1 second latency, but aggregated data processing can push this to 30-40 seconds.
    • Real-time updates can be done without full model retraining, focusing on updating feature values at inference time.
    • Some systems allow partial retraining with recent data to enhance accuracy.
  3. Model Performance & Testing:

    • Companies like Netflix employ A/B testing and switchback experiments to evaluate model improvements while accounting for seasonality.
    • Shadow deployments (scoring without serving predictions) help validate new models before full rollout.
    • Companies often deploy thousands of models optimized for specific use cases and geographies.
  4. Data Management & Infrastructure:

    • Technologies like Iceberg are becoming the standard for managing large data lakes.
    • DuckDB is recommended for querying and processing structured data efficiently.
    • There is a trend toward separating compute and storage, where vendors provide compute services while customers retain data ownership.
    • Cross-account access models (e.g., Databricks) are gaining traction to maintain data control while enabling efficient processing.
  5. Emerging Trends & Considerations:

    • Egress costs and data access patterns are important factors in architectural decisions.
    • Apache Data Fusion and similar projects aim to create pluggable database engines for more flexible computation.
    • Businesses are exploring ways to optimize data processing while balancing efficiency and cost.

The discussion emphasized the benefits of cloud-native architectures and disaggregated storage and compute models to reduce costs and improve scalability.

  • Avoiding Egress Costs: Using VPC endpoints instead of routing through the public internet reduces egress charges, as data remains within the AWS backend.
  • Disaggregated Storage & Compute: Platforms like Snowflake follow this model, allowing independent scaling of compute and storage. Services like Backblaze R2 and Cloudflare R2 offer S3-compatible but cheaper storage alternatives.
  • WarpStream & Kafka Optimization: WarpStream optimizes Kafka by making its agents stateless, reducing intra-zone traffic costs and allowing quick scaling for burst handling. Kafka’s traditional approach incurs high synchronization costs across brokers.
  • Trends in Compute Engines: New compute engines, including those used in Spark, follow similar trends, disaggregating compute from storage for better efficiency.

  1. Achal referred to this blog post during the presentation - https://motherduck.com/blog/big-data-is-dead/ ↩︎

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures

Supported by

Meetup Sponsor

Numberz.ai is a collaboration and knowledge management platform that connects teams, streamlines communication, and enhances decision-making through advanced AI