Programming Languages and Machine Learning


This website is a pointer to the latest research at the intersection of programming languages and machine learning done at the Software Reliability Lab of ETH Zurich. In particular, it contains research related to building statistical programming engines -- systems built on top of machine learning models of large codebases. These are new kinds of engines which can provide statistically likely solutions to problems that are difficult or impossible to solve with traditional techniques.

Statistical Engines

JSNice

JSNice de-obfuscates JavaScript programs. JSNice is a popular system in the JavaScript commmunity used by tens of thousands of programmers, worldwide.

Nice2Predict

Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.

DeGuard

Based on Nice2Predict, DeGuard reverses the process of layout obfuscation done by Android obfuscation systems. It enables security analyses, including code inspection and predicting libraries.

Datasets and Models

150k Python Dataset
Dataset consisting of 150'000 Python ASTs
150k JavaScript Dataset
Dataset consisting of 150'000 JavaScript files and their parsed ASTs
Click here for the synthesized programs for probablistic models (on the above datasets)
JSNice artifact
JSNice artifact that contains an engine, trained model and evaluation dataset
JSNice dataset
List of GitHub repositories used to train JSNice on

Talks (Invited, Keynote, etc)

DeGuard: Statistical Deobfuscation for Android
Android Security Symposium 2017
Programming Languages and Machine Learning
Neural Abstract Machines & Program Induction (NIPS'16 workshop)
Statistical Deobfuscation of Android Applications
CCS 2016 talk
Machine Learning for Programs
CAV'16 Tutorial
Probabilistic Learning from Big Code
ISSTA'16 Keynote Talk
PHOG: Probabilistic Model for Code
ICML 2016 talk
Learning Programs from Noisy Data
POPL 2016 talk
Machine Learning for Programming
Invited Talk at ML4PL'15
PDF Machine Learning for Code Analytics
PLDI'15 Tutorial
Machine Learning for Programming
Invited Talk at MIT ExCAPE'15 Summer School
Machine Learning for Programming
Invited Talk at TCE'15 Conference
Machine Learning for Programming
Talk given by V.Raychev at Columbia University and IBM T.J. Watson Research Center
Programming with Probabilistic Graphical Models
EPFL Colloquium, Dec, 2014
Programming Tools based on Big Data and Conditional Random Fields
Zurich Machine Learning and Data Science Meet-up
Statistical Program Analysis and Synthesis
HVC'14 Keynote
Statistical Program Analysis and Synthesis
ETH Workshop 2014
Code Completion with Statistical Language Models
Talk given at University of Washington and Microsoft Research (by V. Raychev) and EPFL and ETH (by Martin Vechev)

Publications

PDF Program Synthesis for Character Level Language Modeling
Pavol Bielik, Veselin Raychev, Martin Vechev
ICLR 2017
PDF Learning a Static Analyzer from Data
Pavol Bielik, Veselin Raychev, Martin Vechev
arXiv report 1611.01752
PDF Probabilistic Model for Code with Decision Trees
Veselin Raychev, Pavol Bielik, Martin Vechev
ACM OOPSLA'16
PDF Statistical Deobfuscation of Android Applications
Benjamin Bichsel, Veselin Raychev, Peter Tsankov, Martin Vechev
ACM CCS'16
PDF PHOG: Probabilistic Model for Code
Pavol Bielik, Veselin Raychev, Martin Vechev
ACM ICML'16
PDF Learning Programs from Noisy Data
Veselin Raychev, Pavol Bielik, Martin Vechev, Andreas Krause
ACM POPL'16
PDF Programming with Big Code: Lessons, Techniques and Applications
Pavol Bielik, Veselin Raychev, Martin Vechev
SNAPL'15
PDF Predicting Program Properties from "Big Code"
Veselin Raychev, Martin Vechev, Andreas Krause
ACM POPL'15
PDF Phrase-Based Statistical Translation of Programming Languages
Svetoslav Karaivanov, Veselin Raychev, Martin Vechev
Onward'14
PDF Code Completion with Statistical Language Models
Veselin Raychev, Martin Vechev, Eran Yahav
ACM PLDI'14

Resources



Funded by ERC grant BIGCODE - #680358