Summary of Program Synthesis meets Machine Learning
Summary of "Program Synthesis meets Machine Learning"
Overview:
The talk, delivered by Shyam Rajamani (Distinguished Scientist and Managing Director at Microsoft Research India), explores the intersection of Program Synthesis, formal verification, and Machine Learning. It provides a tutorial introduction to Program Synthesis, a comparison with supervised Machine Learning, and presents ongoing research combining these fields to solve practical problems involving noisy, heterogeneous data extraction.
Key Technological Concepts and Product Features
- Program Synthesis Fundamentals:
- Program Synthesis is the automatic generation of programs from specifications.
- Originated from Church’s 1957 work, where the problem was to construct a function \( f \) that satisfies a relational specification \( p(x,y) \).
- Early synthesis was theoretically complex (non-elementary or PSPACE/EXPTIME depending on logic).
- Modern practical synthesis restricts the search space to finite, constrained templates (e.g., "sketching" approach by Solar-Lezama).
- Tools like Sketch, FlashFill (used in Microsoft Excel), and PROS enable practical Program Synthesis by filling holes in partially specified programs using examples or constraints.
- Connection to Machine Learning:
- Supervised Machine Learning can be viewed as synthesizing a program (a model) from input-output examples.
- Machine Learning typically uses large datasets and optimizes a continuous loss function (e.g., via gradient descent).
- Program Synthesis uses smaller example sets and combinatorial search over a constrained program space.
- ML models are robust to noise but less interpretable; synthesized programs are interpretable but sensitive to noise.
- Heterogeneous Data Extraction Problem:
- Extracting structured fields from semi-structured, heterogeneous data sources (e.g., political candidate records, airline emails).
- Challenges:
- Data heterogeneity causes synthesis tools to fail if forced to produce a single program.
- Manual clustering and labeling for each format is expensive and brittle.
- Existing industry solutions cluster data and generate one extraction script per cluster but require constant maintenance.
- Combining Program Synthesis and Machine Learning:
- Proposed a hybrid system combining:
- Machine Learning models trained on sparse, noisy labeled data to generate initial noisy labels.
- Program Synthesis tools (modified PROS) that take noisy labels and constraints to synthesize a set of disjunctive programs (one per implicit cluster).
- Use domain-specific languages (DSLs) and field-specific constraints (e.g., type or format checks for names, dates).
- Active learning loop: disagreement between ML and synthesis triggers human annotation to improve the model.
- The system leverages voting and confidence scoring between ML and synthesis outputs to reduce noise and improve precision/recall.
- Proposed a hybrid system combining:
- Technical Details:
- Constraints can be soft (weighted) rather than hard, reflecting uncertainty or incompleteness in domain knowledge.
- The synthesis problem is formulated as a quantified Boolean formula (QBF) problem, solved by iterative SAT solving.
- Sampling from ML-generated labels followed by Program Synthesis yields many candidate programs; a minimal cover of programs is selected to maximize coverage.
- The disjunctive program implicitly defines clusters, removing the need for explicit clustering upfront.
- Experimental Results:
- Applied to political dataset and airline travel data.
- Iterative loop improves precision and recall significantly over iterations.
- Domain-specific constraints and DSL tuning critically improve performance.
- Early results show practical promise and interest from industry teams.
- Other Related Work and Opportunities:
- Using ML to prune and guide the search space in Program Synthesis (e.g., ranking grammar productions).
- Neural program induction: generating programs via neural networks, often less interpretable.
- Automatic differentiation frameworks in ML (TensorFlow, PyTorch) treat neural networks as composable programs with differentiable operators, linking PL and ML.
- Formal verification of neural networks for adversarial robustness is an emerging area but faces scalability and specification challenges.
Guides, Tutorials, and Reviews Provided
- Tutorial on Program Synthesis:
- Historical context and theoretical foundations.
- Practical synthesis using templates and sketches.
- Examples from bit manipulation problems and FlashFill.
- Programming Languages Perspective on Supervised Learning:
- Viewing ML models as parameterized programs.
- Comparison of ML optimization vs combinatorial search in synthesis.
- Case Study: Heterogeneous Data Extraction Framework:
- Problem motivation and real-world datasets.
- System architecture combining ML and synthesis.
- Metrics (precision, recall, F1) and evaluation methodology.
- Discussion on constraints and domain knowledge encoding.
- Discussion on Formal Verification and ML:
- Verification of neural networks.
- Challenges in specifying robustness properties.
- Potential role of synthesis as monitors for interpretable correctness.
Main Speakers / Sources
Notable Quotes
— 00:00 — « No notable quotes »
Category
Technology