Dec 5, 2022 8 min read

pip install data_team

Why you should develop an internal Python pip package for your data team.

Let me ask you a question! Does this sound familiar?

"Ugh, I've already written this function in three other projects - which one were they again? I need to find out so that I can copy and paste my code a fourth time here. Agh, I can remember how this code worked, apparently, I also needed this and that and that... "

Well, if you don't identify with this, I might have a question or two for you, but if you do, then you're in luck!

With only one command, we can make all this trouble go away!

Imagine if three magical words could give you access to all of your team's prewritten code and knowledge as soon as you start a project.

For example, how to:

run queries on your data warehouse
rerun your Airflow DAGs
retrieve visualizations from the company's BI tool
upsert freshly scraped data to your CRM
send Slack messages as Slackbot to trick other teammates

I told you it was neat! But how? By leveraging something you've been using since you started programming: libraries.

That's right!

So let's see together why you might want to write an internal Python library for your data team and thus allow you and your teammates to just run pip install your_internal_lib , to stop rewriting over and over again the same code but also, to make your projects simpler, have access to already solved problems, share best practices, and a lot more.

Why write an internal library for your team?

As stated before, the main benefit of a library is obviously to stop wasting time doing what has already been done. While this is a pretty good selling point, it is far from the only thing you'll gain from writing your own as a team.

Access to a compilation of solutions

What if I told you to write a function to get the square root of a number, without using the math library in Python. How would you go about doing that? You'd probably have to invest a bit of time to come up with an efficient solution.

Well, as engineers we're constantly coming up with solutions to these kinds of problems and as you might have noticed if you ever stumbled upon Stackoverflow, we tend to do that a lot.

Unfortunately, once a problem has been solved, the first thing that comes to mind isn't as altruistic as: "my teammates might lose time on this in the future, I should find a way to make my solution accessible to them".

No. The reason you had to come up with a solution in the first place, is just so you could be one step closer to achieving your goal.

But as I said before - it is unfortunate. Someone after you might also need to find out how to get the square root of a number and would've liked to have your solution to have more time to focus on something else that will bring more value.

This is even more true when working in a team. You and your peers tend to work on common issues and will definitely at some point, try to solve the same problems that you might have already solved.

Sure, libraries help have a cleaner code, but fundamentally what it provides is far greater than that: a compilation of solutions to problems people spent time solving to help themselves or others in the future. Isn't that beautiful?

Another reason why internal libraries are a great addition to a data team is: that it sets a standard.

Ever wondered if what you're doing is the best way to do it? Ever wondered if the code you're writing will easily be understood by your peers? These are great questions to ask yourself, but answering them takes time and experience. Once answered, you're left with valuable information that should be shared with others.

Libraries, if done right, should help with just that. They shouldn't only be a package containing solutions to recurring problems, but also a codebase that anyone can use as a reference to know how things should be done.

In other words: your library provides examples of what your standards for doing things are and the best practices you've learned.

How should your code be structured? When to use classes? Do you use type hints? Do you use dataclasses? How should my environment variables be handled in my code? How should I test my code?

These questions can definitely be answered in a nice Notion note but, examples available within your project using only one command might work better.

Collaborate and get feedback

Every data team should strive to achieve a strong collaboration and feedback culture. An internal team library can also help with that.

This might be an idealistic take, but contributing to a project that directly improves your peer's day-to-day, is incredibly rewarding.

So, once an internal library is set up and if you find value in it, when a solution to a new problem is found, you start to wonder: should I add this to our internal library?

This creates a shift towards a more altruistic mindset that makes the team as a whole grow.

Contributing to this library is also a great way to get feedback and improve,as teammates with may be more experience than you will take a look at what you've done and suggest or add improvements that you'll be able to learn from.

Other than peer review, you may also find value in thinking deeply about the things you want to add to that library since now you're not only doing it for yourself but for others that might come across this service that you built and that might need to use it in an important future project of theirs.

This creates a virtuous cycle where you:

Think deeply about what you've done;
Improve it;
Deploy it to the library;
Get feedback;
...

Faster ramp-up time

This library is also a gold mine for newcomers and junior developers. Imagine if as soon as you join a team, by just running one command, you could have access to functions that perform tasks that are exactly what you need for your first project.

Need to get freshly updated data for a sales performance analysis?

Well, just import the Airflow and Redshift classes from the team's library to rerun your Salesforce DAG and run a query on Redshift to get the results. All of this, with only three lines of code.

from team_lib import Airflow, Redshift
 
Airflow().run_salesforce_dag()
df_sales = Redshift().get_query_result("SELECT * FROM salesforce.opportunity")

This doesn't replace internal documentation, but asking to go through your internal library can be a great addition to your onboarding process to quickly give a sense of what tools you're using and see tangible examples of what they might be used for.

Things to be cautious about

Abstraction

The first thing to be cautious of with internal libraries is the level of abstraction.

Often libraries abstract complex stuff to make it more accessible. But, paradoxically, too much abstraction can lead to complexity.

After adding layers on top of layers to make things easier to use, two things can happen: new errors can be created during the abstraction process, and debugging might become harder to do since you have more things to unpack.

Unfortunately, there is no miracle formula to solve this issue. To mitigate these risks, a simple answer would be to just test your code before pushing it to your internal library.

Time-consuming

Making code for your internal library is definitely more time-consuming than just writing a quick function that does a simple task required for your ad hoc analysis.

Just like the libraries you use on a regular basis, you should aim for yours to be easy to use by anyone, in general, your teammates, and this doesn't come by itself. Again, this requires of you to think deeply about what you're doing and how you're doing it.

Doing this over and over again for things you might not even reuse in the future, might seem like a waste. I can understand that.

But, I'm not saying that everything you do must be added to your team's library.

My principle is: if you notice that it is the second or third time you're writing a piece of code or that you think you might need the very thing you're working on in the future, then it might be a good idea to write somewhere that you'll need to implement this in your library.

While I agree that adding code to your internal library takes time and effort, it is hard not to see how this investment can benefit you and your team later on.

Lack of understanding

Trying to oversimplify everything can lead to people not knowing how to solve basic things or will just take them more time.

With a high level of abstraction, people might come to lack an understanding of how things work and this can create situations where it takes someone way too much time to figure something out, which in essence is the exact opposite of what you're trying to achieve.

This needs to be taken into account and everyone in your team should feel responsible for your internal library and incrementally make changes that they feel would improve it.

How to set up a team's library?

Now let's get to the nitty-gritty! How can I set up my team's internal library? How can I run that one command that gives me access to all of my team's knowledge?

pip install your_internal_lib

Actually, setting it up is really easy. All you need is a repository with a Python project containing a setup script (setup.py in the example below).

├── example_lib
│   ├── __init__.py
│   └── services
│       ├── __init__.py
│       └── slack.py
├── README.md
└── setup.py

This setup script is what gives pip the metadata it needs to perform the installation of a package (more info as to how pip handles the setup.py file here).

Here's an example of a very basic setup script:

from setuptools import setup, find_packages

setup(
    name="example_lib",
    version="0.1.0",
    description="Example of an internal Python library that can be used within a Data Team.",
    author="Andrew",
    install_requires=[
        "slacker"    
    ],
    packages=find_packages()
)

Once the repository with the package has been created, go to your Python project that needs it and run this command:

pip install git+your_repo_

This will install your library with its dependencies and you'll be able to see it next to the others:

ls .venv/lib/python3.10/site-packages/example_lib/

Finally, just call whatever module you need from it from within one of your Python files and you should be set:

from example_lib import Slack

Slack.send_message("Success !", "Our internal python library was set up properly !", "success")

Here's the repository related to this example.

Conclusion: try it

If you don't have an internal library yet, try it. I'm sure there are things that you keep on doing, that you could already add to it. As explained before, it is really easy to set up and a great way to:

save time by not writing the same code over and over again;
share with others solutions to problems you worked on and that they might encounter;
showcase best practices and standards that your team should follow to write great applications;
improve by thinking deeply about the code that you make available to others and get feedback from your peers that use or review it;
decrease the ramp-up time of newcomers as they have a code base they can use as a reference for their future projects.

It definitely requires investing time and effort in it and you should beware of not adding too many abstraction levels, but it might become the toolkit that no future projects can start without.

At least, I know that it has become for me. Whether it is to analyze data, create a middleware or just send Slack messages as Slackbot to trick my teammates, I know that I won't have to look for other projects I've done in the past to find out once again, how to do exactly that.

No. All I'll need is one command and I'll be set.