In ThoughtSpot headquarters, there are 2 monitors displaying tests results on our master branch. Every day as you walk into work, these monitors are flickering with some redness. Occasionally you see all-green builds, but typically such joy is short lived and requires a non-trivial effort across the team to resolve.
So why is our master so easily broken? Because every team checks in their code to master directly.
To be fair, this is an efficient way of developing. Development is always on the latest code from all teams; your tests are always running against the latest integration - and all that is great! However, there’s a trade-off between freshness and correctness--coverage. You cannot throw all your rigorous test suites against each commit to master. For example, once a regression is accidentally introduced in master, it could take a while before it’s detected, and could take significant effort to find and revert/fix the offending change among hundreds of commits. When one team’s changes break another team’s functionality, it takes even more effort to figure that out. Gradually developers give up chasing after test failures and things can quickly degenerate into perpetually broken builds.
Once in this steady state, why bother with monitoring build quality? Instead, as a developer it’s tempting to just focus on features and come back to fix tests when we need to release. So people tend to ignore failed tests and we end up in a vicious cycle of poor builds perpetuating more poor builds.
Why is green master important? A constantly green master branch can help deliver software on time and unblock testing, thus increasing the effective amount of testing exposure and improving overall product quality. Is there a practical path from the world of broken builds to green master, green like kale?
One way to reason about it is to revisit where we stand in the freshness-correctness tradeoff. A world where every change goes to master is on the extreme end of freshness. Perhaps we need to trade off some freshness for correctness. One idea that comes to mind is to separate the master branch into different team branches. After separating, each team will have to develop on their own branches. Direct code check into master will not be allowed anymore. Of course we still want to run some quick sanity tests whenever local code changes checks into team branches, so we can catch bugs as early as possible. But in order to allow merge to master, a much larger test suite is required to run. This suite will cover all aspects of our applications that will be affected by code changes, from backend to front end. This test suite could take a few hours to finish and merge frequency is reduced - but that’s what we are signing off with the freshness-correctness trade off.
We made our first step towards a perpetually green master at ThoughtSpot this year using this scheme, and things have been looking up. However, this theoretical idea runs into multiple practical issues routinely. If a component branch fails to integrate with master on a given day for any reason, they miss that window of opportunity and need to wait for another day until all such blockers (bugs) are removed. We’ve found that in some pathological conditions, this may happen quite frequently and the affected component branch keeps falling behind master with every passing day. That’s why we adopted a two way strategy - constantly down merging from master to team branch. We run same qualify jobs on master branch constantly, merging into every team branch if tests succeeded. This way we can guarantee every team branch is up to date with master branch.
There is another advantage of running tests on master branch constantly. When we run qualify tests for each team branch, we do a local merge to master first. By running the same tests on master constantly, we can be more confident that the issue is not coming from the master branch.
Since the adoption of this model, we have been happy to see green master more and more often. The red pixels on our build monitors have been aching to be called into action, but to no avail! Of course, there are many places where things can still be improved. We should add more tests in qualify jobs, including performance tests. We should allocate resources to allow qualify test suites run more often to catch bug earlier.
The most exciting part is we can start working on continuous delivery now. With a smooth CI model, we should be able to easily achieve CD and then Nirvana, in that order!