Skip to content

Latest commit

 

History

History
80 lines (43 loc) · 16.4 KB

monorepo-advantages.md

File metadata and controls

80 lines (43 loc) · 16.4 KB

← Articles

Purported Advantages of Monolithic Repositories (Monorepos)

Note

This article is a description of just a small part of our development process. Our goal with all of our practices is to enable developers to be just as productive on old projects as they are on new projects.

I'd recommend starting with this article that describes that goal:

The Goal: Continuity

In 2015, 6to5 was renamed to Babel. Not too long after that, the Babel project was transitioned to be a monorepo, meaning that all of its code across its (currently) over 150 packages are contained in a single Git repository. At the time, it was considered shocking by many developers to put all of your code into a single repository, but Google, and other large companies (e.g., Twitter) and projects (e.g., Linux) did it, and it was well advocated for, so it fairly quickly became accepted. Many even considered it a best practice.

This is just a personal anecdote, but I remember that one of the primary benefits being cited at the time for open source projects was that it eased issue management on GitHub. Maintainers grew tired of having to monitor 100 issue lists and manually recreate issues that people filed in the wrong repository. At the time, GitHub did not have tooling to move issues and, if I remember correctly, it didn't have a good answer for aggregating issues across repositories in a single organization other than search. So, in order to combat GitHub missing the mark on tooling for open source maintainers, open source maintainers combined all of their nicely partitioned code into a single repository.

Unfortunately, aside from a few exceptions, I believe that monorepos are another fad in a community that is susceptible to fads and it often provides more disadvantages than advantages. There is enough written in favor of monorepos, so I will focus this article on providing a different perspective on some of the purported advantages.

"Easier to setup a development environment" - Babel

The Eventide project has a repository called contributor-assets which contains a script, get-projects.sh. Cloning that repository and running that script (with an environment variable set to specify a parent directory) will clone all of the repositories for that project. We have a similar repository and script that clones over 400 repositories. Our version allows the user to optionally use GNU parallel to parallelize this. A fresh clone of all 400+ repositories took 12 seconds on my machine when done in parallel.

Our contributor-assets project also contains a script and additional setup instructions for getting a developer set up to run the code in the project. It's been my experience that newcomers to the project have no more challenge setting up our project than any monorepo I have ever worked with. If anything, it's been easier because we maintain our scripts and setup instructions well.

The assertion that it is easier to setup a developer environment is demonstrably false. There is nothing magical about a monorepo. Once it is on your computer, it is a collection of directories. How those directories get onto your computer is the primary differentiator that I've observed when it comes to developer setup. The only other relevant difference is that a monorepo can have files in the "root" of a large project. Our contributor-assets project is the closest thing we have to a root project. This has not been an impediment for new team members or existing ones.

"Single lint, build, test and release process" - Babel

This is somewhat vague, so I will address what I feel I have a decent grasp on. If I am misrepresenting or missing something, please let me know.

Sharing an eslint (or any other linter that I am aware of) configuration can be done by putting that configuration into a package. That package can exist inside of its own repository, published to a package server, and used in every repository that a team has. We do this for our legacy TypeScript code that still uses eslint. Each project, before a commit is pushed to our central repository, must have its packages updated and its checks run. This ensures that every repository adheres to the same linting configuration and that each project is checked. It also ensures that that check is done at the smallest reasonable batch size — that of a single repository. There is no need to check an entire project suite when only a subset changes.

If I did want to check every single repository all at once, I could always write a script to do this. It is trivial to combine, but if you ask any fledgling monorepo user about how they ensure that their CI runs only the necessary tests based on a change to a single package, you will hear about how challenging it is to separate.

The same can be said for testing and building. Again, test and build configuration can be shared either across packages, or put in boilerplate scripts. We have a single repository called project-scripts that contains boilerplate scripts as well as scripts for updating older versions of those scripts across our repositories. We didn't always have this; we did it by hand until we had enough repositories that it was worth investing in additional tooling.

Because each repository has its own copy of the scripts, when a single package must alter its build or any other script in some way, they can, without having to adhere to an overly-aggressive generalization. How a given package is tested and built is specialized to that package, but we can still adhere to norms and standards that ensure they are all identical unless there is necessary variation.

"Easy to coordinate changes across modules" - Babel

There is a lot to unpack in this. When I first learned about Google's monorepo, I remember this being touted as one of the benefits. It was stated differently, though — it was a forcing function to ensure that a breaking change to a module was never introduced without also including the update to that module's efferents to account for the breaking change. This is laudable, and we can achieve something similar without a monorepo.

When we start a new web project from out GitHub template project and install gems, we get over 50 private gems. This means that each of our web projects likely has at least 50 private gems. If we make a breaking change to one of those gems, we do not consider that work complete until every one of the efferent projects is updated as well. As a matter of fact, we typically do not allow publishing that breaking gem version until all changes are ready to go in all projects. How this is achieved can vary based on the actual change. Sometimes, we use branches in each project with the necessary changes, and only merge them into master once all are ready and the gem with the breaking change is published. Other times, we may publish a backwards compatible version of the change, update all efferent projects, and then make and publish the breaking change. This ensures that all projects are left in a working state at all times. Regardless of the technique, we make use of communication to coordinate. Our team is relatively small, so basic communication techniques ("@here" in a Slack channel) typically suffice.

The techniques for publishing backwards compatible versions of changes are of particular interest. It is through techniques very similar to these that we can also maintain a bevvy of autonomous projects that are all deployed separately, but need to work together. These are also techniques similar to those used to achieve "zero-downtime migrations", version APIs, version protocol buffers, etc. I assert that they are essential skills to learn, and without them, one should be afraid of any architecture or project that required those skills.

The next aspect to unpack is whether or not it is an improvement to offer the ability for a single pull request or commit to span multiple packages. On its face, this is an increase in batch size. From a purely principle perspective, this is undesirable. We know from our Lean studies that reducing batch size can yield increases in productivity and reduce the chance of mistakes. If your reaction to this is along the lines of, "Not always!", then I would suggest that more study may be called for. We know that limiting commits, package releases, or deployments to a single line changed would be ludicrous. There is a minimum batch size for a given thing in a system with its constraints at a particular point in time.

In this instance though, what is the problem with an increased batch size? Compare the approach of making a breaking change and then updating all efferents to account for that breaking change in a single pull request versus making a backwards compatible change and then updating a single efferent project. Once that single project is updated, it can be deployed to production and final inspection can be done on it. This gives the developer further confidence in their change and they can now make it to every other efferent project, one at a time, validating each one. This is closer to one piece flow and has the benefits that come along with that. It also means that, if your team does pull requests, each individual pull request is smaller. First the backwards compatible change can be reviewed and checked for backwards compatibility, then the first update to the first project can be reviewed. As someone who has done countless live code reviews and reviewed even more pull requests, I can say that, at least in my experience, I make far fewer mistakes and am more attentive when reviewing smaller pull requests or code changes. Code reviews for a repeated collection of similar changes are far more likely to only get a cursory glance and a rubber stamp.

There's another, perhaps more insidious, problem with changes spanning multiple packages. They may obscure a structural design mistake. It may be that you should actually be able to make the change in only one place and that you need to do some work to achieve that state. As an example, we noticed that any time we updated a core style, such as the way a button looks, we had over a dozen projects to deploy. This happened frequently enough that it led us to explore techniques for reducing this friction. Where we landed is a single "Layout" project that contains the application's stylesheets and primary UI "chrome". That project is deployed and every web project uses Nginx SSI to get the latest version of the stylesheet at all times. Now, when we need to change the style of a button, we deploy a single web application, and over a dozen web applications are effectively updated. Similar concerns also led us to the aforementioned project-scripts repository and multiple other innovations.

"Single place to report issues" - Babel

I would say that this is an end-user of the project benefit. As someone who has encountered an issue, I would want to report that issue without having to jump through hoops by being required to categorize the issue to a level that I may not be capable of. This is the same problem with customer support forms that make you select a department, category, and then subcategory, etc. They are user hostile. It's why I always prefer to use customer support email addresses over those forms. So yes, this was a problem with GitHub.

But that's just it, it was a problem with GitHub. GitHub should have fixed this. Eventually they added the ability to move issues from one repository to another, but that doesn't solve the initial problem of "where do I put my issue in the first place?" I'm not sure of exactly the right way to solve this problem because I haven't done the analysis and design. I can say with confidence that forcing your users to forego the structural design benefits of multiple repositories in order to provide a reasonable customer experience was not the right answer, but I understand that it is seen as a benefit.

The fact that this is actually a benefit (to the end user) is one of the reasons that I see this particular thing as one of the primary motivators for adopting monorepos in open source. Eventide solves this problem by providing usage and other support via an active Slack but it has a much smaller user base than something like Babel. If I were running something like Babel, I would look into creating an "issues" repository, pinning it, and linking to it on the site and from every other repository. In short, I would set a target condition of having both multiple repositories and a single place for my users to report issues and work the problem.

"Tests across modules are run together which finds bugs that touch multiple modules more easily" - Babel

This is a challenging one to discuss. There are clear benefits to increasing the breadth of tests to encompass everything under your control and to specifically to exercising all efferents when an afferent changes. There are also downsides, however. The downsides are along the same lines as those of the popular myth that you should, "Write tests. Not too many. Mostly integration." I'm not going to attempt to break that all down in this article, but I have written guidelines for packages that includes some discussion of the necessity of being able to validate a package on its own without having to include its efferents.

Even if your package did have sufficient testing to stand on its own in your monorepo, you can still benefit from testing efferents with that version of the package. We recognize this as well, which is why we have the capability of referencing a local version of a package (from another repository) when working in another repository. You can see this in Eventide in the symlink-lib.sh and library-symlinks.sh scripts. Essentially, we include a common directory in the Ruby load path and symlink local libraries there when we need to test something between two projects. It's important to note that we either use this for experimentation, where we are not writing code we intend to commit, but we just want to see what is possible, or that we use it for integration with new versions before we publish them. That is, we don't use it to test the afferent because we have already tested the afferent by this point. If we do find an issue with the afferent when using it in the efferent, then we consider that a problem with the afferent's testing, because we should have been able to catch it there.

As I said, this is a challenging one to discuss. The benefits to the test suites being separate are self-evident, but only if you have a good grasp of the principles at play. If you don't, then this section will have likely read as a recommendation for monorepos. We want our packages to stand on their own. We want to encourage the developers to build and test things in isolation where they are shielded from incidental complexity. We want to maintain small batch sizes so that we can make steady progress and we are not loaded down by all of the problems associated with large batches. It is always easier to combine than it is to separate. If your testing strategy for a package requires testing through the efferents of that package, you have not truly separated that package. That package does not stand on its own and therefore, you will not get any of the benefits of it standing on its own. Imagine if Rails required that all Rails apps were in the Rails repository so that it could run all of the projects' tests instead of Rails having its own test suite.


Comments

Subscribe to be notified of new articles

All Articles


Copyright Aaron Jensen 2023-present