Blog

Visualizing Programming Behaviors with Stack Overflow

Stack Overflow

Take a look at the interactive visualization of the Stack Overflow tag data and discover some insights for yourself!

The one constant among all programmers is the use of Stack Overflow. Stack Overflow serves as crowdsourced documentation for many programming communities. Posts are tagged with the appropriate technology, subject, and/or language. Users select those tags from a drop-down menu, with the option to add a tag in the rare case that there is no existing or appropriate one. Some of the most interesting tags are general and found in many languages or sets of tools.

These more general tags reveal two types of insights. First, general tags that are correlated with a language or toolset represent common tasks within that toolset (e.g., csv is grouped with Python but csv’s are not specific to the Python language). Second, the same general tags convey inherently challenging tasks within toolsets. Distinct clusters of tags may also share general tags which depicts underlying commonalities.

Many popular Stack Overflow posts revolve around tasks that are both uncommon and challenging. These concepts require Stack Overflow because they are rarely touched upon and thus require more assistance to solve when they re-appear. Consider your set of rarely used bash commands that must be found in Stack Overflow every time they come up.

In order to see the relationships between tags, we generated a co-occurrence matrix of every tag across posts. Stack Overflow periodically releases a dataset that includes the text, code, and tags for every post. The following network visualization depicts the co-occurrence matrix of the top 416 tags:

The clusters are calculated using the Louvain community detection algorithm and they were used as the foundation of my analysis. Every link is considered when determining the clusters, but for clarity, the weaker links are concealed in the final visualization. The clusters, depicted by a lighter colored path outline, form well-defined reflections of real-world programming toolsets. The cluster labels are done by hand so some of the smaller groups are left unnamed.

C / Programming and Java

Take a look at this large central cluster that includes C:

Nearly every tag in this cluster is related to basic programming paradigms. This is our introductory cluster because, as one would expect, some of these basic paradigms, like strings, are relatively difficult in C. Others, like pointers, are a main feature of C programming, but not some other common languages like Java or Python.

We might expect this cluster to align closely with C++ but it is directly tied to the Java ecosystem:

The union of these two clusters might reflect the common use of C and Java in introductory classes and, more generally, their use in all types of software and industry.

JavaScript and OOP

Next let’s take a look at the JavaScript and web cluster that aligns with a much smaller blue cluster:

The yellow cluster is the naturally formed group of web-based technology: JavaScript, HTML, jQuery, etc. But take note of the smaller blue nodes that are splitting the larger cluster:

As you can see, the smaller nodes are OOP (Object-Oriented-Programming) and other related software-design tags. Now I won’t go so far as to say that object oriented programming is impossible in JavaScript, but…classes were added to standard JavaScript with the release of ECMAScript 2015. For example, look at what happens if you type “javascript is everything…” into Google:

Lets face it. Object-oriented design patterns have not historically been well-supported in JavaScript, so it’s natural this topic would generate a lot of discussion.

Python and C++

Take a look at the Python cluster:

The Python cluster is the best example of “what is common?” Notice the tags: csv, plot, list, dictionary. These tags are not specific to Python but yet if you think about pandas, numpy, data-frame and matplotlib, it is clear that Python is commonly used for data science.

What interests me further is the Python cluster’s proximity to the C++ cluster:

C++ is a language known for graphics and optimized programming. This union could be due to the types of common tasks in each language (e.g., image processing is popular in both languages). After a quick Stack Overflow search, I discovered C++ is used to achieve performance boosts in Python code:

C# / .Net and Events

C# and .Net have long been the de facto ecosystem for Windows-based software development.

The C# cluster is also close to the button and event handling cluster, which might mean that managing events and UI interactions in C# is a task that generates a lot of questions:

To see why there might be a lot of questions about event handling in C#, we can look at the Stack Overflow posts. The second-highest voted post about C# and events is a basic question about how events work in C#:

Node.js

The Node.js cluster is significantly distinct from the others. The distance is interesting as this cluster is home to concurrency, multithreading, sockets and some other advanced programming concepts:

This is a possible reflection of using Node.js and Express for real-time, socket-based applications. This could also hint at the nature of adapting JavaScript as a backend language. What makes Node.js distinct from other environments, namely asynchronous event-handling and concurrency, also seems to be its most talked about aspects.

PHP

Evidently, dealing with unicode characters is a common part of programming successful php applications:

Conclusion

Stack Overflow is a valuable window into the questions and problems that developers face every day. In this blog post we examined these by analyzing the co-occurrence of tags on Stack Overflow posts. Determining the common themes in developers’ tasks and challenges can lead to better technological designs in the future. These advances may come in the form of new frameworks, better documentation, or changes to APIs. We hope that analyses like ours will move the conversation forward and contribute to a deeper understanding of the way we use software development technology.