Code Completion

(naturalness and neural-based approaches)

Data Representation

Sequential

Input data = code token sequences

--------------------------------------------
Sequences can be made of tokens, subtokens, char or BPE tokens. Usually BPE representation of tokens yields better results.

AST

Input data = parsed ASTs

Path-based

The model predicts a target node given the previous ones

Traversal-based

The model predicts any node by traversing the AST paths

Hindle et al. 2012 and Tu et al. 2014

- n-grams language models for code completion

- n-grams language models with cache component (improves simple n-grams models)

----------------------------------------------------------------

- any token completion

- reproductibility: easy for simple n-grams LMs

Hellendoorn and Devanbu 2017

- n-grams language models with cache and RNN language models

- comparison between both LMs for any code completion

- static vs dynamic test. In dynamic configuration, the model updates his parameters after some predictions on a test file

--------------------------------------------------------------------------------

- any token completion

- reproductibility: easy (artificact available)

Raychev et al. 2014

- comparison between n-grams LMs and RNNs for partial program completion

----------------------------------------------------------------------------------

- reproductibility: easy (basic models)

- only for APIs completion

Semantic

Description

Bhoopchand et al. 2016

- representation of classes with attributes to learn long-range dependencies

- comparison between n-grams and LSTM (with and without attention)

- normalization of identifiers for better results and reduce the vocabulary space

----------------------------------------------------------------------------------------------

- identifier completion (class attributes and calls)

- reproductibility: easy (artifact available)

Li et al. 2017

- comparison between LSTM and LSTM with attention

- compare the result with the percentage of OOV tokens

----------------------------------------------------------------------

- type and value of node completion

- reproductibility: medium

Karampatsis et al. 2020

- open-vocabulary approach for code completion using subtokens/BPE representations

- compare n-grams/n-grams with cache/neural language models/neural language models with cache

- provide a detailed data analysis regarding the tokenization technique used

-------------------------------------------------------------------------------------

- any token completion

- reproductibility: easy

Svyatkovskiy et al. 2020

- reranking of existing completion tools (a simple one and a more advanced one)

- extract the previous 80 tokens before each API call site

- comparison between several encoding techniques (full token, subtokens, BPE) and language models (LSTM/GRU/Transformer)

----------------------------------------------------------------------------------------------------

- API completion in Python

- Show results on few unseen APIs (not the main purpose of the work)

- reproductibility : medium

New Note

Description