Machine Learning for Code

Funded by ERC grant BIGCODE - #680358

Startups

DeepCode

DeepCode offers the first AI-based code review system

Statistical Engines

JSNice

JSNice de-obfuscates JavaScript programs. JSNice is a popular system in the JavaScript commmunity used by tens of thousands of programmers, worldwide

Nice2Predict

Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.

DeGuard

Based on Nice2Predict, DeGuard reverses the process of layout obfuscation done by Android obfuscation systems. It enables security analyses, including code inspection and predicting libraries.

DEBIN

Based on Nice2Predict, DEBIN recovers debug information (e.g., names and types) of stripped binaries, helpful for various analysis tasks like decompilation, malware inspection and similarity.

Datasets and Models

150k Python Dataset

Dataset consisting of 150'000 Python ASTs

150k JavaScript Dataset

Dataset consisting of 150'000 JavaScript files and their parsed ASTs

Probablistic models

Sythesized programs for probabilistic models (on the above datasets)

JSNice artifact

JSNice artifact that contains an engine, trained model and evaluation dataset

JSNice dataset

List of GitHub repositories used to train JSNice on

Download

Publications

2023

Large Language Models for Code: Security Hardening and Adversarial Testing

Jingxuan He, Martin Vechev

ACM CCS 2023 CC BY 4.0 by @fontawesome - https://fontawesome.com

Distinguished Paper Award

Slides

Paper

Code

2022

On Distribution Shift in Learning-based Bug Detectors

Jingxuan He, Luca Beurer-Kellner, Martin Vechev

ICML 2022

Paper

Code

2021

Learning to Explore Paths for Symbolic Execution

Jingxuan He, Gishor Sivanrupan, Petar Tsankov, Martin Vechev

ACM CCS 2021

Slides

Paper

Code

TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer

Berkay Berabi, Jingxuan He, Veselin Raychev, Martin Vechev

ICML 2021

Slides

Talk

Paper

Code

Learning to Find Naming Issues with Big Code and Small Supervision

Jingxuan He, Cheng-Chun Lee, Veselin Raychev, Martin Vechev

PLDI 2021

Poster

Talk

Paper

Robustness Certification with Generative Models

Matthew Mirman, Alexander Hägele, Timon Gehr, Pavol Bielik, Martin Vechev

PLDI 2021

Paper

2020

Learning Fast and Precise Numerical Analysis

Jingxuan He, Gagandeep Singh, Markus Püschel, Martin Vechev

PLDI 2020

Slides

Talk

Paper

Code

Guiding Program Synthesis by Learning to Generate Examples

Larissa Laich, Pavol Bielik, Martin Vechev

ICLR 2020

Talk

Paper

Adversarial Robustness for Code

Pavol Bielik, Martin Vechev

ACM ICML 2020

Slides

Talk

Paper

2019

Learning to Infer User Interface Attributes from Images

Philippe Schlattner, Pavol Bielik, Martin Vechev

ArXiv 2019

Paper

Learning to Fuzz from Symbolic Execution with Application to Smart Contracts

Jingxuan He, Mislav Balunović, Nodar Ambroladze, Petar Tsankov, Martin Vechev

ACM CCS 2019

Slides

Talk

Paper

Code

Unsupervised Learning of API Aliasing Specifications

Jan Eberhardt, Samuel Steffen, Veselin Raychev, Martin Vechev

PLDI 2019

Paper

Scalable Taint Specification Inference with Big Code

Victor Chibotaru, Benjamin Bichsel, Veselin Raychev, Martin Vechev

PLDI 2019

Slides

Talk

Paper

2018

Robust Relational Layouts Synthesis from Examples for Android

Pavol Bielik, Marc Fischer, Martin Vechev

ACM OOPSLA 2018

Slides

Talk

Paper

DEBIN: Predicting Debug Information in Stripped Binaries

Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, Martin Vechev

ACM CCS 2018

Slides

Talk

Paper

Code

Inferring Crypto API Rules from Code Changes

Rumen Paletov, Petar Tsankov, Veselin Raychev, Martin Vechev

PLDI 2018

Talk

Paper

2017

Learning a Static Analyzer from Data

Pavol Bielik, Veselin Raychev, Martin Vechev

CAV 2017

Slides

Talk

Paper

Program Synthesis for Character Level Language Modeling

Pavol Bielik, Veselin Raychev, Martin Vechev

ICLR 2017

Paper

2016

Probabilistic Model for Code with Decision Trees

Veselin Raychev, Pavol Bielik, Martin Vechev

ACM OOPSLA 2016

Paper

Statistical Deobfuscation of Android Applications

Benjamin Bichsel, Veselin Raychev, Peter Tsankov, Martin Vechev

ACM CCS 2016

Slides

Talk

Paper

2015

Predicting Program Properties from "Big Code"

Veselin Raychev, Martin Vechev, Andreas Krause

ACM POPL 2015

Paper

Programming with Big Code: Lessons, Techniques and Applications

Pavol Bielik, Veselin Raychev, Martin Vechev

SNAPL 2015

Paper

2014

Code Completion with Statistical Language Models

Veselin Raychev, Martin Vechev, Eran Yahav

ACM PLDI 2014

Paper

Phrase-Based Statistical Translation of Programming Languages

Svetoslav Karaivanov, Veselin Raychev, Martin Vechev

Onward 2014

Paper

Talks

Learning to Analyze Programs at Scale

Machine Learning for Programming Workshop, FLOC 2018

Slides

Learning a static analyzer from data

Computer Aided Verification 2017

Slides

Talk

Probabilistic and Interpretable Models for Code

SYNT workshop, FLOC 2018

Slides

Machine Learning for Programming

iFM 2017 Keynote Talk

Slides

DeGuard: Statistical Deobfuscation for Android

Android Security Symposium 2017

Slides

Talk

Programming Languages and Machine Learning

Neural Abstract Machines & Program Induction (NIPS'16 workshop)

Slides

Statistical Deobfuscation of Android Applications

CCS 2016 talk

Slides

Talk

Machine Learning for Programs

CAV'16 Tutorial

Slides

Probabilistic Learning from Big Code

ISSTA'16 Keynote Talk

Slides

PHOG: Probabilistic Model for Code

ICML 2016 talk

Slides

Learning Programs from Noisy Data

POPL 2016 talk

Slides

Machine Learning for Programming

Invited Talk at ML4PL'15

Machine Learning for Code Analytics

PLDI'15 Tutorial

Slides

Machine Learning for Programming

Invited Talk at MIT ExCAPE'15 Summer School

Slides

Machine Learning for Programming

Invited Talk at TCE'15 Conference

Slides

Talk

Programming with Probabilistic Graphical Models

EPFL Colloquium, Dec, 2014

Talk

Programming Tools based on Big Data and Conditional Random Fields

Zurich Machine Learning and Data Science Meet-up

Statistical Program Analysis and Synthesis

HVC'14 Keynote

Statistical Program Analysis and Synthesis

ETH Workshop 2014

Slides

Talk

Code Completion with Statistical Language Models

Talk given at University of Washington and Microsoft Research (by V. Raychev) and EPFL and ETH (by Martin Vechev)

Slides

Talk

Resources

A new web site for learning from Big Code has been released here: HERE.
We are co-organizing a Dagstuhl Seminar on "Programming with Big Code", Nov 15-18, 2015