Summary of "[JULY 2023] Community Call"
The July 2023 Community Call featured updates on the Apache Hoodie project and a detailed presentation by data engineers from Job Target on how they leveraged Hoodie to optimize their data infrastructure.
Community Updates:
- The Hoodie community reported significant activity with 128 merged pull requests, 70 open pull requests, 43 closed issues, and 40 new issues.
- The Slack community grew to over 3,400 users, and GitHub stars surpassed 4,300.
- Forty-three contributors submitted pull requests in the last month.
- A major release (version 0.14.0 Beta) is expected soon, introducing record-level indexing and a primary keyless data model.
- Two new RFCs (71 and 72) are in progress to improve conflict detection and redesign Hoodie’s Spark integration.
- Community members are encouraged to engage via Slack, GitHub, and blogs for technical deep dives and use case discussions.
Presentation by Job Target Engineers (Samil and Divyansh):
- Context: Job Target is a talent acquisition solutions provider managing rapidly growing data volumes (doubling from ~20 TB to ~40 TB in a week, with over 82 million objects).
- Challenges: Managing data visualization, quality, security, analysis, and processing speed at scale.
- Solution: Adoption of Apache Hoodie, a streaming data lake platform that integrates database-like features (transactions, incremental processing, schema evolution) directly on data lakes.
- Benefits of Hoodie:
- Eliminates duplicates,
- Enhances data management,
- Supports incremental data processing,
- Provides resiliency with schema enforcement,
- Accelerates queries via multimodal indexes,
- Enables time travel and transactional guarantees,
- Reduces costs by processing only new data and optimizing storage.
- Implementation Details:
- Job Target uses a templated, serverless architecture on AWS with Glue ETL jobs.
- Each application has isolated AWS accounts with separate environments (QA, UAT, production).
- Raw data lands in S3 buckets, ingested incrementally via Glue jobs into Hoodie-managed Silver and Gold zones.
- Job scheduling is managed by AWS EventBridge, with job metadata stored in DynamoDB.
- Lambda functions trigger Glue jobs based on metadata, handle retries, and send failure alerts.
- Hoodie tables are used for incremental ETL pipelines, joining event data with mapping and date dimension tables.
- Querying is done through Spark SQL and Athena, with visualization via AWS QuickSight.
- Technical Choices:
- For streaming (Silver zone), they use Hoodie’s MOR (Merge On Read) table type for fast writes.
- For the Gold zone, they use Hoodie’s Copy On Write table type for optimized query performance.
- Async clustering and automated file sizing improve query speed and reduce S3 list operation costs.
- Outcomes:
- Processing time reduced from hours to minutes.
- Cost savings of 4-5x due to incremental processing and storage optimization.
- Enhanced scalability and monitoring via a planned front-end UI for job setup and management.
- Development Effort:
- The framework was built in about 15 days by reusing code from a prior AWS Batch framework project that took 6-7 months.
- The team chose Hoodie over Delta Lake because of their heavy reliance on AWS Glue and need for tight integration.
- The entire framework is open source and available on GitHub for community use.
Q&A Highlights:
- Clarifications on Hoodie table types used (MOR for Silver zone, COW for Gold zone).
- Plans to expand the framework to support more data sources and add front-end monitoring.
- Discussion on cost considerations including storage and compute, with Hoodie helping reduce both.
- Addressing challenges with S3 list operation costs by using async indexing.
- Encouragement for community members to try the open-source framework and contribute.
Presenters and Contributors:
- Nadine (Host)
- Samil (Lead Data Engineer, Job Target)
- Divyansh Patel (Software Developer and Data Engineer, Job Target)
- Manish (Community participant, asked questions)
- Other unnamed community members participated in Q&A and comments.
This call highlighted Hoodie’s growing community, upcoming features, and a real-world case study demonstrating Hoodie’s impact on improving data lake efficiency, cost, and scalability using a serverless, templated approach on AWS.
Category
News and Commentary