Summary of Program Synthesis meets Machine Learning

Summary of "Program Synthesis meets Machine Learning"

Overview:
The talk, delivered by Shyam Rajamani (Distinguished Scientist and Managing Director at Microsoft Research India), explores the intersection of Program Synthesis, formal verification, and Machine Learning. It provides a tutorial introduction to Program Synthesis, a comparison with supervised Machine Learning, and presents ongoing research combining these fields to solve practical problems involving noisy, heterogeneous data extraction.


Key Technological Concepts and Product Features

  1. Program Synthesis Fundamentals:
    • Program Synthesis is the automatic generation of programs from specifications.
    • Originated from Church’s 1957 work, where the problem was to construct a function \( f \) that satisfies a relational specification \( p(x,y) \).
    • Early synthesis was theoretically complex (non-elementary or PSPACE/EXPTIME depending on logic).
    • Modern practical synthesis restricts the search space to finite, constrained templates (e.g., "sketching" approach by Solar-Lezama).
    • Tools like Sketch, FlashFill (used in Microsoft Excel), and PROS enable practical Program Synthesis by filling holes in partially specified programs using examples or constraints.
  2. Connection to Machine Learning:
    • Supervised Machine Learning can be viewed as synthesizing a program (a model) from input-output examples.
    • Machine Learning typically uses large datasets and optimizes a continuous loss function (e.g., via gradient descent).
    • Program Synthesis uses smaller example sets and combinatorial search over a constrained program space.
    • ML models are robust to noise but less interpretable; synthesized programs are interpretable but sensitive to noise.
  3. Heterogeneous Data Extraction Problem:
    • Extracting structured fields from semi-structured, heterogeneous data sources (e.g., political candidate records, airline emails).
    • Challenges:
      • Data heterogeneity causes synthesis tools to fail if forced to produce a single program.
      • Manual clustering and labeling for each format is expensive and brittle.
    • Existing industry solutions cluster data and generate one extraction script per cluster but require constant maintenance.
  4. Combining Program Synthesis and Machine Learning:
    • Proposed a hybrid system combining:
      • Machine Learning models trained on sparse, noisy labeled data to generate initial noisy labels.
      • Program Synthesis tools (modified PROS) that take noisy labels and constraints to synthesize a set of disjunctive programs (one per implicit cluster).
    • Use domain-specific languages (DSLs) and field-specific constraints (e.g., type or format checks for names, dates).
    • Active learning loop: disagreement between ML and synthesis triggers human annotation to improve the model.
    • The system leverages voting and confidence scoring between ML and synthesis outputs to reduce noise and improve precision/recall.
  5. Technical Details:
    • Constraints can be soft (weighted) rather than hard, reflecting uncertainty or incompleteness in domain knowledge.
    • The synthesis problem is formulated as a quantified Boolean formula (QBF) problem, solved by iterative SAT solving.
    • Sampling from ML-generated labels followed by Program Synthesis yields many candidate programs; a minimal cover of programs is selected to maximize coverage.
    • The disjunctive program implicitly defines clusters, removing the need for explicit clustering upfront.
  6. Experimental Results:
    • Applied to political dataset and airline travel data.
    • Iterative loop improves precision and recall significantly over iterations.
    • Domain-specific constraints and DSL tuning critically improve performance.
    • Early results show practical promise and interest from industry teams.
  7. Other Related Work and Opportunities:
    • Using ML to prune and guide the search space in Program Synthesis (e.g., ranking grammar productions).
    • Neural program induction: generating programs via neural networks, often less interpretable.
    • Automatic differentiation frameworks in ML (TensorFlow, PyTorch) treat neural networks as composable programs with differentiable operators, linking PL and ML.
    • Formal verification of neural networks for adversarial robustness is an emerging area but faces scalability and specification challenges.

Guides, Tutorials, and Reviews Provided


Main Speakers / Sources

Notable Quotes

00:00 — « No notable quotes »

Category

Technology

Video