Provenance of content will be a huge challenge due...
# thinking-together
j
Provenance of content will be a huge challenge due to recent advancements in AI/ML. My immediate thought was "but code is in Git etc. and we know the author", but that's all void if the actual author of the code was a tool like ChatGPT/Co-Pilot and the dev was just the one that pushed it. Maybe AI will be what brings about next-gen versioning systems where content provenance is managed at the AST node level during code authoring, and not just by whoever pushed the code after the fact.
n
Why does provenance matter? If it's good enough to pass a human review, then it's good enough. So quality/accuracy will matter more than provenance. Or did you have copyright liability in mind?
j
I think the review aspect is that weak part of the link. More people, with less programming knowledge will be adding code that was written by AI, and even developers are busy and will accept code that looks correct. All that code is then fed into the training set of the next round, and you've got a positive feedback loop. It's the same as self-driving cars. Humans learn to trust the system, relax, and then it's all fine until it's not. I do think we'll eventually get to a good place, but right now we're in the gap between two worlds, and that's when the damage happens.
n
Speaking only as a developer, we cannot accept code that just looks correct because we have quality review infra in place. Any code needs to pass the test cases. Can it pass test cases while being incorrect? May be, but then that's equally likely with a human. So, generated code is fine. But generated prose is a different matter. We will need accuracy-checker, bullshit-checker etc kind of systems.
j
How will the AI know the difference between well-tested code and code that's not during its training?
n
Not the AI, but the AI training team will very likely put quality filters for the training set. Instead of training on entire github or stackoverflow, you only train on the highly-rated repos or answers.
j
I see your point, but highly-rated is a simple metric, that might be more related to popularity/hype than code-quality/correctness. It's a super difficult problem.
n
My point is, when it comes to code, the vicious cycle of bad quality content -> bad model -> bad quality content is just not there. Because our quality/rating systems work. For prose, yes, this is absolutely true.
j
Interesting. I think code and prose are the same in that regard 😄
n
Not really. Spam on the web is mostly prose. There are both incentives as well as lack of quality filter tools. Nobody is spamming github with incorrect code.
j
Ah, yep, agreed. they're different in terms of scale and incentives. I was thinking from a more fundamental level: Quality assurance of language content at scale (both code and prose are symbolic, and use/construct abstractions). They're similar problems in that regard, but we might be lucky that the incentives make training on code a less complicated practical problem.
n
Meanwhile, StackOverflow has temporarily banned answers by ChatGPT because they're too inaccurate: https://meta.stackoverflow.com/questions/421831/temporary-policy-chatgpt-is-banned
j
Haha, yep it sure is confident, even while wrong 😄
This actually feeds back to my initial worry. Code alone is not enough to train systems like ChatGTP. It won't know how to map natural language onto code. That comes from training on sources like StackOverflow. See how this spam makes prose and code not so different? The social systems and incentives do appear to have some overlap. This policy change is needed, but difficult to enforce at scale. Same as QA'ing spam/prose because it requires a lot of effort to know whether something is a good answer.
Humans talking about code -> Social systems/incentives -> Humans gaming the systems with AI
m
AI systems based on large language models are prone to human-like errors, both in solving logic puzzles and in writing program code. Which gives me less confidence in code review as a quality check on AI-written code - both the AI and the human are okay with code that "looks good". The advances in the scientific method come from developing techniques to work around limitations and drawbacks in human reasoning. We'll need to apply these (and new ones) to AI systems. In my view, the test suite is going to be the most important part of the code base, since an AI will write the code. (Until the AI's can write the test suite as well 🙂 ).
w
Things to keep in mind: 1. Part of what makes ChatGTP and improvement over plain GPT-3 comes from training for relevance rather than the first thing you would think of. So adjusting the training criteria should then feed into... 2. If an AI is writing code, it should iterate with itself running the code before coming back with an answer. Basically, mix in some of what's done to play games. In fact... 3. We could potentially see quick improvement with these chat systems if well engineered prompts can go back into the training of the system. For example, one reason why "in the style of" prompts get more pleasant output from ChatGTP is that they steer the system away from its default middling BS mode. Curiously, one good use of ChatGTP is to help discover likely misconceptions that students learning technical subjects are likely to have. In conversations so far, I've seen ChatGTP be fuzzy on the distinctions between: 1. Continuity and uncountability in math, and 2. block and procs in Ruby. To the credit of incorporating so much training data, ChatGTP is conversant on so many topics. It's kind of interesting how OpenAI has put a filter on this so as to avoid avoid answering questions where ChatGTP's knowledge might be limited.