Summary of "[June 2022] Apache Hudi - Community Call"
Summary of [June 2022] Apache Hudi - Community Call
1. Apache Hudi Community Updates & Release 0.11.1
- The community is highly active with around 170 active pull requests and over 80 contributors pushing 90+ commits in the last month.
- Increased interest and contributions in Spark and Flink integrations.
- The minor release 0.11.1 focused on performance improvements and bug fixes from 0.11.0, resolving over 100 issues.
- Key fixes include:
- The release was benchmarked with TPC-DS to validate performance improvements.
- The community now has over 2000 Slack users and 3200 GitHub stars.
- Resources for newcomers include official docs, wikis, RFCs, GitHub repo, mailing lists, and Slack channel.
2. Incremental ETL with Apache Hudi at Uber (Speaker: Vinos3)
- Motivation: Traditional batch ETL is slow and inefficient, often recomputing entire datasets (e.g., weeks or months of data), causing latency and data quality issues, especially with late-arriving data.
- Incremental ETL Model:
- Processes only the delta (new or changed data) instead of full data scans.
- Uses Hudi primitives such as incremental pull and upsert, leveraging Hudi timelines and indexes.
- Employs merge-on-read to reduce write amplification.
- Achieves sub-hour latency and 100% data completeness, even for updates to old partitions (e.g., 6 months old).
- Architecture:
- Joins in Incremental ETL:
- Careful handling of stream-static table joins (e.g., left outer joins).
- Incremental reads combined with selective partition fetching to avoid full scans.
- Backfills use snapshot reads with insert-overwrite to maintain data quality.
- Use Case: Driver earnings aggregation with frequent updates, including late-arriving tip data.
- Performance Gains:
- Reduced pipeline run time from 4-5 hours to ~45 minutes.
- Latency reduced from 31 hours to a few hours.
- Cost savings up to 300% on dimensional tables due to reduced compute and memory usage.
- User Feedback:
- Mixed experience initially due to complexity of incremental joins and handling non-incremental columns.
- Documentation and runbooks improved usability.
- Key Takeaway: Hudi’s metadata and timeline enable efficient incremental ETL that traditional batch processing cannot achieve.
3. Experience Running Apache Hudi on EMR EKS at Bazaar (Speaker: Marian)
- Background: Bazaar is a fast-growing e-commerce company focused on digitalizing retail in Pakistan, requiring real-time data with low cost.
- Challenges with EMR:
- EMR has concurrency limits (max ~200 jobs, but practically ~25 jobs efficiently).
- Frequent job failures and long queue times due to resource contention.
- Auto-scaling on EMR is slow and unreliable, causing SLA violations.
- Limited instance type compatibility, especially memory-optimized instances.
- Monitoring on EMR is cumbersome, requiring manual log downloads.
- Motivation to Move to EMR on EKS:
- Needed better concurrency, cost efficiency, and real-time data delivery.
- EKS offers flexibility in instance types and better auto-scaling using Kubernetes features.
- Easier monitoring and observability with integrated Prometheus and Grafana dashboards.
- Architecture Overview:
- Data ingestion via CDC logs into data meshes and object storage.
- Batch and streaming processing via Apache Hudi on Spark running on EMR EKS.
- Data lifecycle managed through raw, branch, and gold layers for governance and productization.
- Real-time data exposed via internal APIs and dashboards for business users and vendors.
- Benefits Observed:
- 61% cost reduction compared to EMR.
- Improved job stability and concurrency.
- Better resource utilization via Kubernetes autoscaling and custom schedulers (e.g., AWS Karpenter).
- Enhanced monitoring and operational visibility.
- Remaining Challenges:
- Key Takeaway: EMR on EKS provides a scalable, cost-effective, and flexible platform for running
Category
Technology