Summary of "[June 2022] Apache Hudi - Community Call"

Summary of [June 2022] Apache Hudi - Community Call

1. Apache Hudi Community Updates & Release 0.11.1

The community is highly active with around 170 active pull requests and over 80 contributors pushing 90+ commits in the last month.
Increased interest and contributions in Spark and Flink integrations.
The minor release 0.11.1 focused on performance improvements and bug fixes from 0.11.0, resolving over 100 issues.
Key fixes include:
- Addressing performance regressions in partition pruning and Spark SQL query caching.
- Fixing confusing behavior in key generation between Spark SQL and data source writers to reduce production errors.
- Improvements across table config, metadata, and table services.
The release was benchmarked with TPC-DS to validate performance improvements.
The community now has over 2000 Slack users and 3200 GitHub stars.
Resources for newcomers include official docs, wikis, RFCs, GitHub repo, mailing lists, and Slack channel.

2. Incremental ETL with Apache Hudi at Uber (Speaker: Vinos3)

Motivation: Traditional batch ETL is slow and inefficient, often recomputing entire datasets (e.g., weeks or months of data), causing latency and data quality issues, especially with late-arriving data.
Incremental ETL Model:
- Processes only the delta (new or changed data) instead of full data scans.
- Uses Hudi primitives such as incremental pull and upsert, leveraging Hudi timelines and indexes.
- Employs merge-on-read to reduce write amplification.
- Achieves sub-hour latency and 100% data completeness, even for updates to old partitions (e.g., 6 months old).
Architecture:
- Incoming event streams (e.g., Kafka) are ingested incrementally.
- Spark SQL is used predominantly for transformations, with custom transformers also supported.
- Checkpointing and concurrency controls (via HMS or Zookeeper locks) ensure data quality and avoid race conditions.
Joins in Incremental ETL:
- Careful handling of stream-static table joins (e.g., left outer joins).
- Incremental reads combined with selective partition fetching to avoid full scans.
- Backfills use snapshot reads with insert-overwrite to maintain data quality.
Use Case: Driver earnings aggregation with frequent updates, including late-arriving tip data.
Performance Gains:
- Reduced pipeline run time from 4-5 hours to ~45 minutes.
- Latency reduced from 31 hours to a few hours.
- Cost savings up to 300% on dimensional tables due to reduced compute and memory usage.
User Feedback:
- Mixed experience initially due to complexity of incremental joins and handling non-incremental columns.
- Documentation and runbooks improved usability.
Key Takeaway: Hudi’s metadata and timeline enable efficient incremental ETL that traditional batch processing cannot achieve.

3. Experience Running Apache Hudi on EMR EKS at Bazaar (Speaker: Marian)

Background: Bazaar is a fast-growing e-commerce company focused on digitalizing retail in Pakistan, requiring real-time data with low cost.
Challenges with EMR:
- EMR has concurrency limits (max ~200 jobs, but practically ~25 jobs efficiently).
- Frequent job failures and long queue times due to resource contention.
- Auto-scaling on EMR is slow and unreliable, causing SLA violations.
- Limited instance type compatibility, especially memory-optimized instances.
- Monitoring on EMR is cumbersome, requiring manual log downloads.
Motivation to Move to EMR on EKS:
- Needed better concurrency, cost efficiency, and real-time data delivery.
- EKS offers flexibility in instance types and better auto-scaling using Kubernetes features.
- Easier monitoring and observability with integrated Prometheus and Grafana dashboards.
Architecture Overview:
- Data ingestion via CDC logs into data meshes and object storage.
- Batch and streaming processing via Apache Hudi on Spark running on EMR EKS.
- Data lifecycle managed through raw, branch, and gold layers for governance and productization.
- Real-time data exposed via internal APIs and dashboards for business users and vendors.
Benefits Observed:
- 61% cost reduction compared to EMR.
- Improved job stability and concurrency.
- Better resource utilization via Kubernetes autoscaling and custom schedulers (e.g., AWS Karpenter).
- Enhanced monitoring and operational visibility.
Remaining Challenges:
- Initial complexity in setting up Hudi jobs on EMR EKS.
- Fine-tuning autoscaling and resource allocation is ongoing.
Key Takeaway: EMR on EKS provides a scalable, cost-effective, and flexible platform for running