Summary of "#219 Building a Data Platform that Drives Value | Shuang Li, Group Product Manager at Box"
Summary of #219 Building a Data Platform that Drives Value
Shuang Li, Group Product Manager at Box
Overview
Shuang Li, Group Product Manager at Box, shares insights from the multi-year journey of building Box’s data platform. The discussion highlights the challenges, key components, and strategic decisions involved in creating a scalable, reliable, and efficient data infrastructure that supports millions of users and billions of objects.
Key Technological Concepts & Product Features
1. What is a Data Platform?
- A high-scale data infrastructure designed to solve business problems by consolidating data from multiple sources, enabling analytics, insights, and product features.
- Before the platform, Box’s data was siloed across teams, leading to challenges in scalability, reliability, quality, and cost.
2. Core Components of a Data Platform
- Data Ingestion Pipeline: Captures data from diverse sources such as on-platform transactional data, metadata, user events, and third-party tools like SnapLogic. It must handle different formats, SLAs, and scales.
- Data Processing: Includes ETL pipelines and both batch and stream compute capabilities. Streaming is essential for near real-time use cases like anomaly detection in user activity for security.
- Data Storage & Management: Selecting appropriate tools and technologies to store and manage data efficiently, balancing performance, cost, and feature requirements.
3. Building the Platform During Cloud Migration
- The data platform was developed alongside Box’s company-wide migration to the cloud, adding complexity but enabling the use of cloud-native solutions.
- Collaboration with data engineering teams (within the go-to-market organization) and other stakeholders such as product, marketing, compliance, and customer success was critical for alignment and success.
4. Team Structure & Collaboration
- Product managers lead the data platform efforts, working closely with data platform engineers and a separate data engineering team.
- The data engineering team supports ETL pipelines and broader data needs, especially for marketing and customer success use cases.
5. Iterative Development & Milestones
- The platform development was likened to climbing a mountain with multiple peaks (milestones).
- The approach emphasized simplicity, iterative delivery, and prioritization between “lift and shift” (migrating existing workloads) and “uplift” (building optimized cloud-native architecture).
6. Technology Choices & Trade-offs
- Decisions between building in-house versus buying vendor solutions depend on feature sets, cost (licensing, engineering, maintenance), long-term scalability, vendor ecosystem, innovation speed, partnership quality, and strict security requirements.
- Early engagement with the security office is critical due to stringent compliance needs.
7. Data Quality & Observability
- Initially under-invested during migration, data quality and observability have become major focuses.
- The team restructured to separate foundational platform work from developer experience to better prioritize observability features such as data freshness, lineage, classification, and automated anomaly detection.
- Tools like Data Catalog are used for metadata management, data discovery, and lineage tracking.
8. Developer Experience & Data Democratization
- A key metric is reducing “time to value” — how quickly internal teams can onboard, explore, experiment, and move to production using the platform.
- Efforts include improving data discovery (tagging, descriptions, metadata), providing playground environments for experimentation, documentation, tutorials, and shadow environments for production-like testing.
- Data democratization focuses on making data accessible to both technical and non-technical users (e.g., product analysts), reducing time spent searching for data from weeks to much shorter periods.
9. Challenges Encountered
- Learning curve and best practices around cloud migration.
- Managing a massive, company-wide project with many stakeholders and continuously aligning goals.
- Balancing engineering team morale between foundational “lift and shift” work and innovative “uplift” features.
10. Cost Management & Scaling
- Quarterly and monthly cost forecasting considering organic growth and new use cases.
- Strategic decisions on vendor usage to optimize costs (e.g., paying a vendor to reduce ingestion volume to another vendor).
- Emphasis on the “rule of 40” metric for SaaS companies balancing revenue growth and profit margin, with data platform costs contributing to profit margin.
11. Metrics & Alignment (The “L-ther Up” Framework)
- Three levels of metrics:
- Company-level (profitable growth)
- Product & engineering-level (e.g., enabling new use cases like streaming)
- Team-level (e.g., delivering streaming capability)
- This framework helps communicate impact and motivate engineers by linking their work to business outcomes.
12. Future Trends & Next Steps
- Continued investment in developer experience and tooling/frameworks to simplify data aggregation and business logic implementation for non-expert teams.
- Uplifting the log pipeline with tiered architecture supporting real-time, analytics, and compliance needs.
- Leveraging AI for data observability (automated anomaly detection, data loss, late arrival) and enhancing data discovery with natural language querying capabilities.
Guides, Tutorials, or Best Practices Highlighted
- Start with clear alignment across all stakeholders early and maintain it throughout the journey.
- Break down the massive project into smaller, iterative milestones with achievable goals.
- Prioritize simplicity in architecture and incremental adoption of cloud-native tools.
- Separate foundational platform work from developer experience to ensure focused investments in data quality and usability.
- Use a structured metric framework to link engineering work to business impact and maintain team morale.
- Invest in metadata management and data catalog tools to improve data discovery and democratization.
- Engage security teams early when selecting vendors to avoid delays.
Main Speaker / Source
- Shuang Li – Group Product Manager at Box, leading the data platform team.
End of Summary
Category
Technology