-
Image: iStock/monsitj
As data science becomes critical to any organization, it has become just as important to determine the right tools to master it. The two most popular languages for tackling data science problems are Python and R. Both programming languages are open source with large communities. But Python and R also bring in their own unique strengths to data science, making it harder to decide which to use.
R vs. Python: the main differences
R is an open-source, interactive environment for doing statistical analysis. It’s not really a programming language, but it includes a programming language to help with the analysis.
As outlined on the site of the R project: “R is an integrated suite of software facilities for data manipulation, calculation and graphing [which] includes… a large, coherent, integrated collection of intermediate data analysis tools….” While not the first such tool, R was early in data science and has been a staple of academia for quite some time now.
TO SEE: Hiring Kit: Python Developer (Tech Republic Premium)
Python, on the other hand, is an open-source, “high-level interpreted, object-oriented programming language with dynamic semantics,” according to the project website† However, this doesn’t really do it justice. Python is an easy to learn general purpose language that is often the first language a developer will learn as it has long been a teaching language.
“It’s easy to use, easy to pick up, kids use it, non-programmers pick it up in a weekend,” Peter Wang, CEO of Anaconda ever related† “This is no coincidence [but rather] has been a hardcore part of the design from the start and very intentional.”
As a close result, Python has always been great as a glue language as well. As RedMonk analyst Rachel Stephens has emphasized, “In that sense, it makes a lot of sense for enterprises to invest in Python as a way to invest in their established code.” In other words, Python helps enterprises make legacy code part of their more recent data science ambitions.
This is perhaps where the main benefit of Python for data science stands out: everyone knows it.
“Python is the second best language for everything”, said Van Lindberg, general advisor to the Python Software Foundation. “R may be best for statistics, but Python is second… and second best for ML, web services, shell tools, and (insert use case here).”
Lindberg may be underestimating Python’s power in some areas; it’s obviously not always the next best thing, but his point is directionally correct: “If you want to do more than just statistics, Python’s breadth is an overwhelming win.”
In other words, Python is good enough that developers and others choose to use it for a wide variety of use cases. Python, like Java, is a general purpose programming language; however, unlike Java, it is quite easy to learn and use. As such it is used for all sorts of things leading to ‘explosive growth’ such as Wang ever described the. No wonder, then, that if we take into account the relative growth and decline between Python and R jobs for data scientistsfrom 2019 to 2021, as Terence Shin has done, then it is clear that Python wins at the expense of R†
R vs. Python: which is better for data science?
While Python has proven to be more popular than R, that doesn’t mean it’s always better. As with most things in technology, it depends on what you hope to achieve. While Python has a lower bar to learn and become productive, and R’s off-the-shelf approach can be cumbersome to learn, for some tasks it pays to invest in learning R. And of course for some things, like data mining and data visualization, you can probably choose well too.
However, what you choose should stem from the problem you’re trying to address and the long-term investments you and your business plan to make.
For example, R is better suited for statistical computation and data visualization because R is purpose-built by statisticians for statistical and numerical analysis of large data sets. You don’t need to write a lot of code in R to drive in-depth statistical analysis and data visualization.
Also, for some areas, such as life sciences, the R packages can be particularly well developed, making R a good choice. A lot depends on what you are building and your background. As Align BI partner Ryan Hobson said in an interview, “I think R is an easier language for statisticians who may not have a programming background.”
But it’s exactly that “programming background” that makes Python the clear winner for developers or others interested in big data, artificial intelligence (AI), and deep learning algorithms.
“Python had a broader scope [than R] from the beginning [with engineering and science] DNA baked into the Python core,” Wang said. It’s objectively true that Python is dramatically more popular, in a much wider range of usage scenarios, than R, and it’s getting more and more every day.
Then there is the reality that the nature of data science is changing.
“There has also been an expansion beyond what has traditionally been a pure data science team; at Netflix, for example, we have the role of Algorithms Product Manager,” noted Christine Doig, director of innovation for personalized experiences at Netflix. There is more integration with the design team, with creative teams.”
That expansion of specialization in data science calls for a wider variety of people to help with the data science load, which in turn favors a language like Python that is more widely used.
Therefore, there is a very real question of whether it is worth investing in R to solve a relatively limited number of use cases versus Python, which allows an organization to meet a wide variety of use cases. The answer may be yes, but you have to think carefully.
Or maybe you should just wait. After all, the R and Python communities are both actively improving their relative capabilities by adding packages and libraries to deepen and extend their usability. In this area, however, the advantage goes to Python, both because of the relative size of its community, but also because of its glue code pedigree.
According to Wang, it’s entirely possible that instead of replacing R for some use cases, “maybe someone will build a nice Python wrapper to expose a thin shim to expose some R capabilities.” In other words, it’s not hard to imagine Python embracing those native elements of R so developers and data scientists don’t have to choose.
Both R and Python serve their respective constituencies well. Yes, the Python community is much bigger and pulls R packages into the Python ecosystem rather than vice versa, but which one you’re going to use may ultimately be a matter of both, not if.
Disclosure: I work for MongoDB, but the views expressed herein are my own.