Hey folks! Is anyone working/ interested by ML on...
# thinking-together
m
Hey folks! Is anyone working/ interested by ML on Code? This is a topics I've been looking into with @Shubhadeep Roychowdhury. We believe it to be such a promising field for future of coding. ➡️ ground breaking function/docstring mismatch detection ➡️ pityless variable misuse ➡️ dope auto-complete ... Some really nice articles and pieces of code are getting out (you can keep a track here : https://ml4code.github.io/) . If you're interested in discussing about it or your projects that would be really great ! Let me know 🙂
❤️ 5
k
I'm curious to hear more about "pityless variable misuse".
m
Reminds me of https://openreview.net/group?id=ICML.cc/2018/Workshop/NAMPI#all-papers-under-review seems like the paper archive website went down😢 but videos are still up

https://www.youtube.com/watch?v=au2TG6G6_Pw&list=PLC79LIGCBo81_H_wIBBIOu2GfF3OIixdN

. hope you find it helpful somehow
m
@Kartik Agaram I can refer you to the Cubert paper (https://arxiv.org/pdf/2001.00059.pdf) 🎯 Create a model which can predict if there is a variable misuse in the body of a function 🏆 94% accuracy in the prediction The model has been trained for Python code specifically; but other papers provide models which are more language-agnostic (though performance are a bit lower)
👍 1
@Mike C. I did not know about this project. But yes this is a concrete example of ML on code. And given the breakthrough of NLP along the last months, the performance are skyrocketing. There is ML4Code website which have some more recent papers on the topics. The github repo of source{d} also have ressources (but as the company has stopped it's not maintained anymore)
a
Thanks for mentioning ☝️ Some ex-source{d} people, including me, have recently moved to https://research.jetbrains.org/groups/ml_methods and continue working in this field.
❤️ 1
🍰 1
s
@Alex Bzz Great to know that you worked there. I really liked the work source-d did.
👍 1
m
Great @Alex Bzz! I was a bit disappointed when realising your repo were not updated anymore^^
s
Hey, @Alex Bzz just a curious question, I know that Source-{d} had collected a huge Public Git Achieve. And you also ahd a small tool written in Go to explore and download the files. I was wondering if there is anyway to get that archive anymore. It does not seem possible using the tool. Please let me know if you have any idea.
a
Indeed, several Tb of archives are gone from GCS and the company servers by now, but the procedure of collecting the data works https://github.com/src-d/datasets/tree/master/PublicGitArchive/pga-create#pga-create
s
Thanks a lot
Well, out of luck. @Alex Bzz It breaks at installation. I guess the versions of libraries that it depends upon is not available anymore or something like that. Do you know if a Docker image with everything preinstalled is available somewhere or not?
Never mind, I am using a pre-built binary and trying to generate the data. Thanks for the pointers
👍 1