Summary of "[JULY 2023] Community Call"

The July 2023 Community Call featured updates on the Apache Hoodie project and a detailed presentation by data engineers from Job Target on how they leveraged Hoodie to optimize their data infrastructure.

Community Updates:

The Hoodie community reported significant activity with 128 merged pull requests, 70 open pull requests, 43 closed issues, and 40 new issues.
The Slack community grew to over 3,400 users, and GitHub stars surpassed 4,300.
Forty-three contributors submitted pull requests in the last month.
A major release (version 0.14.0 Beta) is expected soon, introducing record-level indexing and a primary keyless data model.
Two new RFCs (71 and 72) are in progress to improve conflict detection and redesign Hoodie’s Spark integration.
Community members are encouraged to engage via Slack, GitHub, and blogs for technical deep dives and use case discussions.

Presentation by Job Target Engineers (Samil and Divyansh):

Context: Job Target is a talent acquisition solutions provider managing rapidly growing data volumes (doubling from ~20 TB to ~40 TB in a week, with over 82 million objects).
Challenges: Managing data visualization, quality, security, analysis, and processing speed at scale.
Solution: Adoption of Apache Hoodie, a streaming data lake platform that integrates database-like features (transactions, incremental processing, schema evolution) directly on data lakes.
Benefits of Hoodie:
- Eliminates duplicates,
- Enhances data management,
- Supports incremental data processing,
- Provides resiliency with schema enforcement,
- Accelerates queries via multimodal indexes,
- Enables time travel and transactional guarantees,
- Reduces costs by processing only new data and optimizing storage.
Implementation Details:
- Job Target uses a templated, serverless architecture on AWS with Glue ETL jobs.
- Each application has isolated AWS accounts with separate environments (QA, UAT, production).
- Raw data lands in S3 buckets, ingested incrementally via Glue jobs into Hoodie-managed Silver and Gold zones.
- Job scheduling is managed by AWS EventBridge, with job metadata stored in DynamoDB.
- Lambda functions trigger Glue jobs based on metadata, handle retries, and send failure alerts.
- Hoodie tables are used for incremental ETL pipelines, joining event data with mapping and date dimension tables.
- Querying is done through Spark SQL and Athena, with visualization via AWS QuickSight.
Technical Choices:
- For streaming (Silver zone), they use Hoodie’s MOR (Merge On Read) table type for fast writes.
- For the Gold zone, they use Hoodie’s Copy On Write table type for optimized query performance.
- Async clustering and automated file sizing improve query speed and reduce S3 list operation costs.
Outcomes:
- Processing time reduced from hours to minutes.
- Cost savings of 4-5x due to incremental processing and storage optimization.
- Enhanced scalability and monitoring via a planned front-end UI for job setup and management.
Development Effort:
- The framework was built in about 15 days by reusing code from a prior AWS Batch framework project that took 6-7 months.
- The team chose Hoodie over Delta Lake because of their heavy reliance on AWS Glue and need for tight integration.
- The entire framework is open source and available on GitHub for community use.

Q&A Highlights:

Clarifications on Hoodie table types used (MOR for Silver zone, COW for Gold zone).
Plans to expand the framework to support more data sources and add front-end monitoring.
Discussion on cost considerations including storage and compute, with Hoodie helping reduce both.
Addressing challenges with S3 list operation costs by using async indexing.
Encouragement for community members to try the open-source framework and contribute.

Presenters and Contributors:

Nadine (Host)
Samil (Lead Data Engineer, Job Target)
Divyansh Patel (Software Developer and Data Engineer, Job Target)
Manish (Community participant, asked questions)
Other unnamed community members participated in Q&A and comments.

This call highlighted Hoodie’s growing community, upcoming features, and a real-world case study demonstrating Hoodie’s impact on improving data lake efficiency, cost, and scalability using a serverless, templated approach on AWS.