Our Future of Coding project is taking shape :star...
# thinking-together
m
Our Future of Coding project is taking shape 🤩 We are building Machine Learning Model that can learn how to code 🏆 With @Shubhadeep Roychowdhury we are excited to release codeBERT. It's a Masked Language Model trained over Python source code thanks to Hugging Face. 🕹️ You can easily load the model and its weights (code below)
Copy code
from transformers import *
tokenizer = AutoTokenizer.from_pretrained("codistai/codeBERT-small-v2")model = AutoModelWithLMHead.from_pretrained("codistai/codeBERT-small-v2")
📈 Example of the results are in the thread below 📚 Full tutorial on how to load and fine-tune the model for downstream tasks is coming!
🎉 7
codeBERT results over Python code
😮 1
a
Thanks for making it available! It looks like it’s a pure language/BERT model… Is there a clear path toward integrating with external sources of information like the Python runtime? I’m noticing in your example that some likelihood is assigned to sequences that refer to variables that don’t appear to be in scope. Is there a natural language model embedded inside it? Like, if I feed it “# <mask>” how likely am I to get a comment that documents the next line of code?
❤️ 1
s
Hello @alltom Thanks for the comments. At the moment this model is a pure Masked Language Model (a version of RoBERTa) which has 10 layers, and 12 attention head per layer.(With a max position embedding of 1024) And it was trained on about 6M lines of code. So this is just a start. Context Awareness in code means something slightly different than what it means for Natural Language. We have big plans for future. We plan to try out mix modal models (one where we mix graph based and text based information together) to bring more code centric awareness in it. Also, we plan to run some kind of Rule Engine on top of the prediction to cut some noise (Here we can hook into the actual Python runtime/interpreter to gain more insights). Also we are planning to train longformer or refomrer type model to calculate local attention based on a large body of code. At the moment the
<mask>
token is only for a single token. However, we have plans to fine-tune this model for many downstream tasks. We started a series of notebooks showing different applications of this model and will release more powerful and hybird models in future. You can checkout our first notebook (Colab) here - https://colab.research.google.com/drive/17QNGQOsQOUBPlblqc7maiOKPbT4hDZfv?usp=sharing
👍🏼 1
🤩 1