Code Completion

(naturalness and neural-based approaches)

Data Representation

Sequential
Input data = code token sequences
--------------------------------------------
Sequences can be made of tokens, subtokens, char or BPE tokens. Usually BPE representation of tokens yields better results.
AST
Input data = parsed ASTs
Path-based
The model predicts a target node given the previous ones
Traversal-based
The model predicts any node  by traversing the AST paths
Hindle et al. 2012 and Tu et al. 2014
- n-grams language models for code completion
- n-grams language models with cache component (improves simple n-grams models)
----------------------------------------------------------------
- any token completion
- reproductibility: easy for simple n-grams LMs
Hellendoorn and Devanbu 2017
- n-grams language models with cache and RNN language models
- comparison between both LMs for any code completion
- static vs dynamic test. In dynamic configuration, the model updates his parameters after some predictions on a test file
--------------------------------------------------------------------------------
any token completion
- reproductibility: easy (artificact available)
Raychev et al. 2014
- comparison between n-grams LMs and RNNs for partial program completion
----------------------------------------------------------------------------------
- reproductibility: easy (basic models)
- only for APIs completion
Semantic
Description
Bhoopchand et al. 2016
- representation of classes with attributes to learn long-range dependencies
- comparison between n-grams and LSTM (with and without attention)
- normalization of identifiers for better results and reduce the vocabulary space
----------------------------------------------------------------------------------------------
- identifier completion (class attributes and calls)
- reproductibility: easy (artifact available)
Li et al. 2017
- comparison between LSTM and LSTM with attention
- compare the result with the percentage of OOV tokens
----------------------------------------------------------------------
- type and value of node completion
- reproductibility: medium
Karampatsis et al. 2020
- open-vocabulary approach for code completion using subtokens/BPE representations
- compare n-grams/n-grams with cache/neural language models/neural language models with cache
- provide a detailed data analysis regarding the tokenization technique used
-------------------------------------------------------------------------------------
- any token completion
- reproductibility: easy
Svyatkovskiy et al. 2020
- reranking of existing completion tools (a simple one and a more advanced one)
- extract the previous 80 tokens before each API call site
- comparison between several encoding techniques (full token, subtokens, BPE) and language models (LSTM/GRU/Transformer)
----------------------------------------------------------------------------------------------------
- API completion in Python
- Show results on few unseen APIs (not the main purpose of the work)
- reproductibility : medium
New Note
Description
   Login to remove ads X
Feedback | How-To