Summary of "#219 Building a Data Platform that Drives Value | Shuang Li, Group Product Manager at Box"
Summary of #219 Building a Data Platform that Drives Value
Shuang Li, Group Product Manager at Box
Overview
Shuang Li, Group Product Manager at Box, shares insights from the multi-year journey of building Box’s data platform. The discussion highlights the challenges, key components, and strategic decisions involved in creating a scalable, reliable, and efficient data infrastructure that supports millions of users and billions of objects.
Key Technological Concepts & Product Features
1. What is a Data Platform?
- A high-scale data infrastructure designed to solve business problems by consolidating data from multiple sources, enabling analytics, insights, and product features.
- Before the platform, Box’s data was siloed across teams, leading to challenges in scalability, reliability, quality, and cost.
2. Core Components of a Data Platform
- Data Ingestion Pipeline: Captures data from diverse sources such as on-platform transactional data, metadata, user events, and third-party tools like SnapLogic. It must handle different formats, SLAs, and scales.
- Data Processing: Includes ETL pipelines and both batch and stream compute capabilities. Streaming is essential for near real-time use cases like anomaly detection in user activity for security.
- Data Storage & Management: Selecting appropriate tools and technologies to store and manage data efficiently, balancing performance, cost, and feature requirements.
3. Building the Platform During Cloud Migration
- The data platform was developed alongside Box’s company-wide migration to the cloud, adding complexity but enabling the use of cloud-native solutions.
- Collaboration with data engineering teams (within the go-to-market organization) and other stakeholders such as product, marketing, compliance, and customer success was critical for alignment and success.
4. Team Structure & Collaboration
- Product managers lead the data platform efforts, working closely with data platform engineers and a separate data engineering team.
- The data engineering team supports ETL pipelines and broader data needs, especially for marketing and customer success use cases.
5. Iterative Development & Milestones
- The platform development was likened to climbing a mountain with multiple peaks (milestones).
- The approach emphasized simplicity, iterative delivery, and prioritization between “lift and shift” (migrating existing workloads) and “uplift” (building optimized cloud-native architecture).
6. Technology Choices & Trade-offs
- Decisions between building in-house versus buying vendor solutions depend on feature sets, cost (licensing, engineering, maintenance), long-term scalability, vendor ecosystem, innovation speed, partnership quality, and strict security requirements.
- Early engagement with the security office is critical due to stringent compliance needs.
7. Data Quality & Observability
- Initially under-invested during migration, data quality and observability have become major focuses.
- The team restructured to separate foundational platform work from developer experience to better prioritize observability features such as data freshness, lineage, classification, and automated anomaly detection.
- Tools like Data Catalog are used for metadata management, data discovery, and lineage tracking.
8. Developer Experience & Data Democratization
- A key metric is reducing “time to value” — how quickly internal teams can onboard, explore, experiment, and move to production using the platform.
- Efforts include improving data discovery (tagging, descriptions, metadata), providing playground environments for experimentation, documentation, tutorials, and shadow environments for production-like testing.
- Data democratization focuses on making data accessible to both technical and non-technical users (e.g., product analysts), reducing time spent searching for data from weeks to much shorter periods.
9. Challenges Encountered
- Learning curve and best practices around cloud migration.
- Managing a massive, company-wide project with many stakeholders and continuously aligning goals.
- Balancing engineering team morale between foundational “lift and shift” work and innovative “uplift” features.
10. Cost Management & Scaling
- Quarterly and monthly cost forecasting considering organic growth and new use cases.
- Strategic decisions on vendor usage to optimize costs (e.g., paying a vendor to reduce ingestion volume to another vendor).
- Emphasis on the “rule of 40” metric for SaaS companies balancing revenue growth and profit margin, with data platform costs contributing to profit margin.
11. Metrics & Alignment (The “L-ther Up” Framework)
- Three levels of metrics:
- Company-level (profitable growth)
- Product & engineering-level (e.g., enabling new use cases like streaming)
- Team-level (e.g., delivering streaming capability)
- This framework helps communicate impact and motivate engineers by linking their work to business outcomes.
12. Future Trends & Next Steps
- Continued investment in developer experience and tooling/frameworks to simplify data aggregation and business logic implementation for non-expert teams.
- Uplifting the log pipeline with tiered architecture supporting real-time, analytics, and compliance needs.
- Leveraging AI for data observability (automated anomaly detection, data loss, late arrival) and enhancing data discovery with natural language querying capabilities.
Guides, Tutorials, or Best Practices Highlighted
- Start with clear alignment across all stakeholders early and maintain it throughout the journey.
- Break down the massive project into smaller, iterative milestones with achievable goals.
- Prioritize simplicity in architecture and incremental adoption of cloud-native tools.
- Separate foundational platform work from developer experience to ensure focused investments in data quality and usability.
- Use a structured metric framework to link engineering work to business impact and maintain team morale.
- Invest in metadata management and data catalog tools to improve data discovery and democratization.
- Engage security teams early when selecting vendors to avoid delays.
Main Speaker / Source
- Shuang Li – Group Product Manager at Box, leading the data platform team.
End of Summary
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...