Summary of Program Synthesis meets Machine Learning

Summary of "Program Synthesis meets Machine Learning"

Overview:
The talk, delivered by Shyam Rajamani (Distinguished Scientist and Managing Director at Microsoft Research India), explores the intersection of Program Synthesis, formal verification, and Machine Learning. It provides a tutorial introduction to Program Synthesis, a comparison with supervised Machine Learning, and presents ongoing research combining these fields to solve practical problems involving noisy, heterogeneous data extraction.

Key Technological Concepts and Product Features

Program Synthesis Fundamentals:
- Program Synthesis is the automatic generation of programs from specifications.
- Originated from Church’s 1957 work, where the problem was to construct a function \( f \) that satisfies a relational specification \( p(x,y) \).
- Early synthesis was theoretically complex (non-elementary or PSPACE/EXPTIME depending on logic).
- Modern practical synthesis restricts the search space to finite, constrained templates (e.g., "sketching" approach by Solar-Lezama).
- Tools like Sketch, FlashFill (used in Microsoft Excel), and PROS enable practical Program Synthesis by filling holes in partially specified programs using examples or constraints.
Connection to Machine Learning:
- Supervised Machine Learning can be viewed as synthesizing a program (a model) from input-output examples.
- Machine Learning typically uses large datasets and optimizes a continuous loss function (e.g., via gradient descent).
- Program Synthesis uses smaller example sets and combinatorial search over a constrained program space.
- ML models are robust to noise but less interpretable; synthesized programs are interpretable but sensitive to noise.
Heterogeneous Data Extraction Problem:
- Extracting structured fields from semi-structured, heterogeneous data sources (e.g., political candidate records, airline emails).
- Challenges:
  - Data heterogeneity causes synthesis tools to fail if forced to produce a single program.
  - Manual clustering and labeling for each format is expensive and brittle.
- Existing industry solutions cluster data and generate one extraction script per cluster but require constant maintenance.
Combining Program Synthesis and Machine Learning:
- Proposed a hybrid system combining:
  - Machine Learning models trained on sparse, noisy labeled data to generate initial noisy labels.
  - Program Synthesis tools (modified PROS) that take noisy labels and constraints to synthesize a set of disjunctive programs (one per implicit cluster).
- Use domain-specific languages (DSLs) and field-specific constraints (e.g., type or format checks for names, dates).
- Active learning loop: disagreement between ML and synthesis triggers human annotation to improve the model.
- The system leverages voting and confidence scoring between ML and synthesis outputs to reduce noise and improve precision/recall.
Technical Details:
- Constraints can be soft (weighted) rather than hard, reflecting uncertainty or incompleteness in domain knowledge.
- The synthesis problem is formulated as a quantified Boolean formula (QBF) problem, solved by iterative SAT solving.
- Sampling from ML-generated labels followed by Program Synthesis yields many candidate programs; a minimal cover of programs is selected to maximize coverage.
- The disjunctive program implicitly defines clusters, removing the need for explicit clustering upfront.
Experimental Results:
- Applied to political dataset and airline travel data.
- Iterative loop improves precision and recall significantly over iterations.
- Domain-specific constraints and DSL tuning critically improve performance.
- Early results show practical promise and interest from industry teams.
Other Related Work and Opportunities:
- Using ML to prune and guide the search space in Program Synthesis (e.g., ranking grammar productions).
- Neural program induction: generating programs via neural networks, often less interpretable.
- Automatic differentiation frameworks in ML (TensorFlow, PyTorch) treat neural networks as composable programs with differentiable operators, linking PL and ML.
- Formal verification of neural networks for adversarial robustness is an emerging area but faces scalability and specification challenges.

Guides, Tutorials, and Reviews Provided

Tutorial on Program Synthesis:
- Historical context and theoretical foundations.
- Practical synthesis using templates and sketches.
- Examples from bit manipulation problems and FlashFill.
Programming Languages Perspective on Supervised Learning:
- Viewing ML models as parameterized programs.
- Comparison of ML optimization vs combinatorial search in synthesis.
Case Study: Heterogeneous Data Extraction Framework:
- Problem motivation and real-world datasets.
- System architecture combining ML and synthesis.
- Metrics (precision, recall, F1) and evaluation methodology.
- Discussion on constraints and domain knowledge encoding.
Discussion on Formal Verification and ML:
- Verification of neural networks.
- Challenges in specifying robustness properties.
- Potential role of synthesis as monitors for interpretable correctness.

Summary of Program Synthesis meets Machine Learning

Summary of "Program Synthesis meets Machine Learning"

Key Technological Concepts and Product Features

Guides, Tutorials, and Reviews Provided

Main Speakers / Sources

Notable Quotes

Category

Video