One of these comments was generated by a computer. Can you tell which one?
Comments — or Lack Thereof
Commenting code is boring. It’s tedious and time-consuming and almost universally disliked. Why should we spend our time writing comments when we could be doing something so much more interesting? Like, for example, designing a model to write our comments for us.
At this point, it’s a well-established fact that programmers regularly fail to comment their code or do so in such a way that the comments are effectively indecipherable. But, beyond being a courtesy to other developers (and oneself, 6 months later), effective comments play a crucial role in the software engineering process. They drive down development time, enable more effective bug detection, and facilitate better code reuse. As much as we would like to avoid writing them, source code comments are incredibly important.
But, like all good programmers, we know that the best way to solve any specific problem is to first solve a more general problem. So, instead of accepting the inevitability of having to comment our own code, we set out to design an algorithm that would take the code we’ve written and generate appropriate comments. And we think we’re on to something. The first comment in the example above was generated by our system. The second was written by a human. Did you choose correctly?
Data — the Main Ingredient
Unfortunately, in order to train a model to write source code comments, you need both source code and comments. Fortunately, there are a number of developers that have been so kind as to provide us with their code and the associated comments in the form of open-source repositories. We use Doxygen to identify code/comment pairs in a large number of repos containing code written in Java, C++, and Python. We use the first sentence of each extracted comment, as it is usually the one that best describes the associated code. We also filter out code/comment pairs where the comments are too long (more than 50 words), too short (fewer than 3 words), or contain phrases like “thanks to” or “created by”, since these are all typically uninformative. Similarly, we filter out code/comment pairs where the code is very long (more than 4096 characters), since these code blocks cannot usually be summarized well by a single sentence. We ultimately use about 85% of the code/comment pairs initially collected.