Rust for Data Engineering—what's the hype about? 🦀
One of the most hyped areas of debate in the data infrastructure space today is the programming language Rust. Some are convinced the language will become the de facto data engineering standard, while many are skeptical of the hype. In this article, we aim to give a brief intro to Rust, its hype, when Rust is a great option for data engineering and when it isn’t. Lastly, we outline the main obstacle to Rust becoming a more widely adopted language.
What is Rust?
Let’s start with the basics. Rust is a “blazingly fast and memory efficient” programming language. It is highly reliable with a rich type system and an ownership model that together guarantee memory-safety and thread-safety. This sits in stark contrast to the more classical languages C and C++ which often come with a plethora of security concerns and vulnerabilities. It was first introduced in 2010 at Mozilla Research, and has been gaining popularity ever since.
The Rust hype
Before diving into the Rust discussion, it’s helpful to take a step back and look at programming language fads and predictions over the years. One example is when Julia was dubbed the “one language to rule them all” in this Wired article from 2014. In the article, the language creators and the author of the article discuss how Julia would be able to replace Matlab, R, Ruby and Python (and in some instances, even Hadoop!) for advanced mathematics problems—while also running at the same speed as Java or C. However, reality didn’t really turn out that way. Julia—despite high popularity in Stack Overflow’s developer survey (see image below)—hasn’t replaced other languages to the extent predicted in the article. The same scenario might play out with Rust for data engineering. To put it bluntly, there’s simply no way to know whether Rust will become the de facto standard for data engineering. This is largely because the language chosen to implement data solutions depends on myriad factors: What language is the data team fluent in? What is the business context in which this code will run? What’s the ecosystem of languages already used in the data stack? Nevertheless, in this article we hope to shed some light on some of the factors that influence whether Rust is a suitable language given a specific data engineering situation.
Now back to the Rust hype. The language has quickly gained popularity among programmers across the world. In 2022, it sat on its seventh(!) year as the most loved language with 87% of developers saying they want to continue using it.
When looking under the hood, it’s easy to understand why. Rust is often described as having excellent documentation, being explicit with no surprises, being super fast and performant, featuring a user-friendly compiler, and of course (as mentioned), avoiding the memory errors of C and C++. Ilson Balliego, Senior Software Engineer says “the compiler deals with most of the common mistakes, so I can focus on my task instead.” In the same vein, Ji Krochmal, Data Engineering Lead adds “Rust is fantastic for debugging. The language makes it easy to write very debuggable code with its explicit error handling.”
“The compiler deals with most of the common mistakes, so I can focus on my task instead”
It’s not only individual programmers that express their love of the language. Multiple data infrastructure companies are publicly announcing their Rust usage. Influx data, and Materialize, Microsoft (Azure), and AWS are some examples of data infrastructure companies that are betting on Rust going forward. Shane Miller, Senior Software Engineering Manager at AWS, says, “Rust helps us deliver fast, robust services to AWS customers at Amazon scale,” and Microsoft Azure CTO Mark Russinovich hails “most loved” Rust as the successor to C and C++.
“Rust helps us deliver fast, robust services to AWS customers at Amazon scale”
Now that we’ve introduced Rust as a language, let’s look at Rust specifically used for data engineering. In 2022, this hype really picked up speed. We started seeing data engineers rave all over the internet about how Rust is taking over and becoming the favorite language of data engineers.
However, it’s easy to get dragged into the explosive growth of new technologies. The truth of the matter is that no language will be a panacea that can cure all broken data pipelines overnight, and Rust might not always be the best language choice. Data engineering is such a broad field, ranging from pure software engineering to BI- and analytics-type tasks, all the way to building pipelines that feed into machine learning workflows—the specific challenges you face in your day-to-day work should define what language and tools are best suited.
With this, let’s take a closer look at when Rust is a good choice, and when it might be a better idea to stay with the tried-and-true friend Python.
When to use Rust for data engineering
As mentioned, some circumstances make Rust an excellent choice for data engineering. These circumstances include: when prioritizing speed, when working with Apache Arrow, when data pipelines have extra security requirements beyond the standard ones, and when rigid data typing is beneficial to your project.
First, if the challenges you’re solving for are highly dependent on speed and performance to process large sets of data quickly, then Rust might be your tool of choice. The language is so performant, that there’s seldom a need to spend time manually optimizing bottlenecks.
Second, Rust has excellent support for the Apache Arrow data format and library to process data in-memory. This means that if you’re working with this type of columnar data, Rust and Apache Arrow can become a match made in heaven to transport and store data. However, Apache Arrow has language bindings in almost every language under the sun (feel free to check out the Arrow Libraries section at the bottom here), so it’s not necessarily the case that Rust has a substantial advantage over other languages—it will depend on your specific situation. In other words, Apache Arrow plays very nicely with Rust, but if there aren’t other reasons to use Rust over other languages, this compatibility in and of itself might not be a strong argument to use Rust specifically. Let’s look a bit deeper into what some of those other reasons to use Rust might be.
Rust’s compiler is very thorough and strict which enables detection of potential bugs at compile time rather than at runtime (e.g. in production). This allows you to eliminate many classes of bugs at compile-time that relate to memory and thread safety. Jonathan Bridger, Software Engineer says
“While this can mean a more difficult and lengthy process of reaching the point where your project can compile, this leads to a more reliable and robust application at runtime.”
Naturally, this leads to a related benefit: security.
When deploying data applications, security is of utmost concern for most companies, and when dealing with PII (Personal identifiable information), this becomes even more the case. Knowing that the language used to run those applications is memory-safe might give an extra bit of peace and calm to the teams responsible for the data. For example, Google has seen a drop in memory safety vulnerabilities in their Android ecosystem when implementing more memory-safe code, e.g. Rust. On December 1, 2022, Google said “In Android 12 we announced support for the Rust programming language in the Android platform as a memory-safe alternative to C/C++. Since then we’ve been scaling up our Rust experience and usage within the Android Open Source Project (AOSP).”
Lastly, Rust has a very rigid type system as opposed to Python which is a more flexibly typed language (see more details later in this text). This means that Rust might be very well suited to data engineering tasks where it’s important to enforce data types. Ji Krochmal, Data Engineering Lead says
“When working to enforce data contracts and schemas it might be extremely beneficial to verify types at compile time, because it can help eliminate some of the most common bugs in the modern data stack”
In summary, Rust might be the best choice for data engineering if speed, performance and reliability are your top priorities, you’re working with Apache Arrow, if safety and security are highly prioritized, and when you might want to have rigidly defined data types.
Now, let’s make the picture more nuanced and take a look at when Rust might actually not be the best choice for data engineering.
When not to use Rust for data engineering
As with almost everything, Rust comes with pros and cons. It turns out that the language might not be the best choice for your data engineering efforts in all instances. For example, when the rest of your data ecosystem is in Python, when you want to prototype something quickly, when your team doesn’t already know Rust, and when you might have a hard time finding talent that’s fluent in Rust (and has experience working with the language).
Let’s look at the rest of the data ecosystem first. Some people state that Python is an “objectively better language” than Rust for certain problems, and we tend to agree. All in all, if you’re working on data engineering tasks that need to integrate with e.g. data science projects that are already written in Python, then Python might be the better choice. The same thing goes for other languages, like Scala and of course, SQL.
In addition, Rust’s type safety might actually be too rigid in some data engineering instances. For example, let’s say you’re processing data that comes from a manually typed source where sometimes you get integer types and sometimes string types. Python has enough flexibility that gives a Data Scientist the freedom necessary to explore data. This flexibility can be taken advantage of in e.g. Jupyter notebooks (a widely adopted tool used to explore data). Coupled with the libraries Pandas and Matplotlib, Jupyter notebooks make it possible to extract the needed information from a dataset in just a few minutes. Rust, on the other hand, is very strict when it comes to its type system, as previously mentioned. This means the shape of your data must be known beforehand and a simple task like reading a CSV file can be very cumbersome without the appropriate libraries. Rust also lacks exploratory tools like Jupyter notebooks, which makes Rust a bad choice for exploring data. However, the effort to make Rust a better tool for this job is starting to show through libraries like Polars.
Also, as we shall look at more in depth shortly, Rust has a (significantly) steeper learning curve than SQL and Python, so it’s not the language to use for a quick mock-up or quick fix. Some engineers would argue that the advantage of higher level languages, like Python, is that the focus can be more on solving the problem than writing the code, i.e. a higher level language is typically less verbose and can express the idea in fewer lines of code. This might be beneficial if you’re trying to prototype an idea quickly, and you don’t want to worry about Rust-specifics like borrow-checking, memory safety, etc. In Python, you might still have those problems, but the compiler (or the interpreter) won’t hassle you about it.
The problem of Rust being a more “difficult” language becomes exacerbated if you and/or your team don’t already know Rust. All in all, Rust might be more appropriate for larger projects and implementations, rather than a quick proof-of-concept or ad hoc analysis. However, Aleksei Pianin, Senior Software Engineer, points out that
“Rust and Python is not an either-or-choice. In fact, it might make a lot of sense to develop ideas and prototypes in Python as part of a research phase, and then let the team transform that code into a production grade Rust implementation.”
Given all this, it might be useful to take a look at Python’s road to widespread adoption in relation to the learning curve—it can give us an idea about what’s in store for Rust. The reason why Python has had such great successes in the machine learning and data science world is that NumPy provided a much better interface to doing linear algebra and matrix calculations than what people were used to, namely BLAS (a Fortran library from the late 70s) and LAPACK. Numpy effectively wrapped those numeric libraries in a way that made it easy to write algorithms at a higher level of abstraction while not sacrificing performance. This could be achieved since they were calling heavily optimized Fortran and C procedures. This in turn provided a much required seed for the Python machine learning ecosystem to be built upon.
Our guess is that many data engineers today might not have the time and expertise to write performant Rust code, unless they spend a large part of their day writing and learning Rust. This follows naturally from the difficulty of the language. They might instead gravitate towards something like a Python library, because it works with their existing ecosystem. A plausible scenario is that they might end up using some nicely wrapped Rust libraries under the hood, similar to the numpy example above, perhaps even unbeknownst to them. This is yet another argument as to why Python and Rust might become complementary, rather than competitors as Aleksei pointed out above.
Lastly, as a direct consequence of the steep learning curve, data engineering talent that can develop using Rust might be significantly more difficult to source and hire. Especially if you’re looking for those who have significant data experience. This becomes a very important factor to consider when choosing a programming language for data engineering in your data team.
To summarize, Rust might not be the best choice for data engineering if Rust as a language isn’t compatible with the rest of your data ecosystem, if you want to prototype something quickly, if your team doesn’t already know Rust, and if you expect to have a hard time finding and sourcing Rust talent.
Obstacles to more wide-spread adoption of Rust
In addition to these factors, there is one more significant question for the data community to figure out before Rust is fully poised to gain even more traction in the field: the relative immaturity of the ecosystem. Compared to other languages like Python and Java, Rust libraries are in their infancy in terms of maturity and variety, effectively limiting where and how Rust can be successfully used. This can largely be explained by the fact that Rust is a relatively new language compared to more established ones like Python, Java and C++. To illustrate, Validio—which is mainly developed in Rust—is using Golang for its ingress services where the SDKs for Snowflake, GCP, and AWS are currently much more mature than for Rust.
Similarly, if you need to build data science projects (or other data infrastructure), Python does have better support from a libraries point of view. Some examples of widely used Python libraries include pandas for dataframes, NumPy for numerical computing, Scikit-learn for machine learning, and Matplotlib for visualization. With that said, Rust is getting more and more support in terms of libraries as the community grows. For example, some data related libraries include polars (mentioned earlier) for data frames, ndarray as a numpy equivalent, and linfra as a scikit-learn equivalent. The library obstacle might thus be temporary and within a few years the gap to other languages might be closed entirely. To follow the development of Rust libraries for machine learning, we recommend regularly checking Are we learning yet’s catalog of Rust libraries related to data.
In summary, it’s clear many developers are big Rust fans. However, before starting to use Rust for data engineering, data teams must understand the requirements and limitations of their particular context. With that said, we’re excited to see how far the language can reach within data pipelines in 2023 and beyond, and we’re particularly excited to follow the evolution of Rust’s libraries. We believe these will be a fundamental factor in unlocking even more widespread adoption of Rust!
P.S. Looking for a career in Rust? Check out our open positions.