Bulletproof Productivity
2022 has become the year of Black Swans which means that businesses across the globe are slowing down. The headlines are repeatedly warning us about winter coming and even tech giants are declaring layoffs.
Supposing you were a CIO facing a hard dilemma: “Who should stay and who should go?”, how would you define the criteria? There are many answers to this question, but none is 100% right. The only way out is to pick the most statistically proven one.
Let’s take a recent case with a business magnate entering savage mode after acquiring a social media company. That person decided that they should say goodbye to half of the acquired company’s workforce and software engineers in particular. The criterion chosen was the lines of code generated over the past year.
How good is the metric? Any software engineer would say it isn't any good, as it ignores the peculiarities of software engineering. For instance, seniors might be spending most of their time reviewing the code of their junior colleagues. They contribute more value by doing this than if they code themselves without assisting their junior colleagues. This way the amount of quality code per dollar is higher. By dollar we mean not only the salary of a developer but the efforts to find senior developers as well.
While some might wonder why the business magnate doesn't understand the simple truth considering their vast experience in business and software engineering. Here is the answer: there weren’t any other viable criteria. When making a quick and hard decision you don't have time to dig deeper. Consequently, now they will need to hire engineers to make up for those who were fired by mistake.
Optimizing your business doesn't necessarily have to be done in such a harsh way. You either accept the flat demand and try to lower your costs or you become more productive with the resources that you already have to achieve more and sell more. This means that you can make use of more advanced analytics than just lines of code. Here is how to do it.
For high-quality decisions, we need reliable metrics obtained from reliable data, and all these are adjusted to the specifics of a company or product. In the world of software engineering let’s target the two categories of people who usually contribute the most: engineers who directly write code, and engineers (usually more senior ones) who, apart from writing the code, help others - set tasks, build the architecture, review the code. Let’s call them indirect contributors. And these two jobs will likely overlap. In addition, we should remember the risk of a lopsided view: e.g. if we only focus on the quantity of the contribution, we may lose quality and vice versa. So, for a balanced view, we consider quality-related metrics as well. All in all, it gives us a ‘skeleton’ like:
With all that in mind, let’s brainstorm a good set of metrics. The goal is as in the introduction above – to understand via a measurable and data-driven way the answer to “who should stay and who should go?”.
So, the evaluation of engineers’ performance via the “lines of code written” metric, mentioned at the beginning of the article, goes to quadrant II. Its expectation is “the higher the better”. Although it misses the quality of work as it may generate thousands of LOC (lines-of-code) they will be buggy and unmaintainable.
How these metrics can be balanced? Usually, there is a history of commits and the ‘life story’ of each line of code available, explaining when each LOC was created and by whom, when modified (and by whom) or deleted, often annotated with comments. Using that, we may analyze the “code churn” which helps to evaluate the quality of the written code. Code churn is the percentage of it being re-written (or deleted) shortly after the code was created, for example, within a few weeks. Although some churn might still be healthy, which shows an intensive creation of new features, for example, a general rule-of-a-thumb is guided by “the higher churn the worse” principle.
Another idea to ‘weight’ the quantity with the quality for this quadrant (just a reminder that we’re still in quadrant “I” 😊), is to measure the cyclomatic complexity and/or maintainability of the code using some automated code analyzers. Some of those are well-known and even free, like SonarQube. The reason is simple: writing the code is no more than 10% of the cost of subsequent maintenance for this code. So, if we minimize the cost of maintenance by investing in the quality of the initial code, this should give a good ROI.
Since different technologies require a different amount of code to implement similar functionality, it is better to compare developers of the same tech stack only.
Additionally, you would need to pick out the actual code, ignoring comments. You should also be able to tell the business logic from the “boilerplate” code, configuration from algorithms, smart usage of libraries from bicycle inventing, and so on. Good engineers know that the less code you have to write to achieve the result, the better. That’s why finding the right library is usually a better investment of time than writing your own implementations of common tasks. But evaluating a library needs a lot of research on its security and flexibility. So, when the new library is added, the system has to assume some work was done in the background to write just a couple of lines using the library.
And sometimes the files in which you have to make changes are very large, complex, and full of spaghetti code, therefore finding the right place to change one “if” may actually take a day.
If you want to compare developers by code, you will need more than several dozens of metrics. This system becomes so complex that you need to use fuzzy logic just to manage it. So, you can build a machine learning model with these metrics as features, train it on a lot of teams and then ask the trained model which cluster would it put that particular developer in – lowest, low, medium, high, or highest performer. And even this result should be just input for a manager’s consideration, not at all enough signal to make a final decision.
We’re on the way to summarizing a full set of relevant metrics to cover all four quadrants...
Let’s think about III and IV, which are to evaluate the performance of indirect contribution.
Quite simple metrics may appear here if we have a chance to involve not only the code itself but also some nearby ecosystems and/or processes that usually help IT teams bring a code to production. For example, one of those is a process of code review – either by peers or by automated code analyzers. In either case, this may give a number and the severity of issues (usually called ‘blames’) raised about the code. Another example would be a software build pipeline (or even CI/CD - which stands for “continuous integration and continuous deployment”) which builds the software from the source code and runs various types of testing, and may even deploy in an automated way to the production environment so that end-users are able to try new features.
Constant replacement of the bottom 20-30% makes sure your teams are getting better.
Collecting a metric from the source engineering data allows ranking according to that metric. For example, to measure individual contribution via the number of LOCs written.
In turn, collecting multiple metrics allows introducing an aggregated (i.e. more balanced) score, with both quantity and quality aspects covered. For example, a productivity ‘score’ might be something like
Going even deeper into math, you may even introduce certain ‘weight’ coefficients for each component of the formula thus focusing with more or less attention on particular factors of productivity in the context of specific engineering culture / product / company. We’ll skip that part for the sake of simplicity. Let’s just assume we already got the ‘score’ and think about how to use that.
The very first natural intention may be to rank people and/or teams in the company according to the score. This gives thebuckets of performance.
For top performers – motivate
Med performers – retain (although don’t invest aggressively)
Low performance – replace
Remember that the ranking makes sense only between developers of the same supposed skill level! That means that if your strategy was to give the junior developers some growth opportunities, it’s unfair to put them in one category with senior developers while ranking.
How big % should be for each of those buckets? That’s a good question.
For example, a legendary Jack Welch used the 20-70-10 rule in General Electric (source - Jack Welch's 20-70-10 | Manager Tools (manager-tools.com))
For continuous improvement of a company and getting its staff and productivity better every day there’s 10% for the bottom bucket and of course, this cultivates a corporate culture of continuous improvement and self-development.
Another case would be a kind of a “big bang” dismissal like the one happened in the social media company mentioned at the beginning. Pure performance scores helps here but it’s not enough, without the understanding of financials like payrolls and company P&L, so the financial aspect is another main input to the equation. This may result in 10%, 30%, 50%, etc. The more ‘balanced’ a performance scoring based on metrics – the more accurate the result (i.e. ‘false firings’).
The main idea is in the core principle: measuring the performance and ranking the results allows identifying top/med/low-performers and making decisions. During this year of disruptions and recessions across the globe, optimization of the company staff according to the strategy of data-driven performance-oriented metrics sounds like a better plan than what most of the industry is doing
EPAM has built an internal platform named PERF for collecting lots of metrics about productivity and quality of software engineering team. Although it has an ability to do individual metrics, its main focus is team productivity. Hundreds of teams were proven to get 50+% more productive by measuring themselves and applying the knowledge of metrics interpretation.
Key capabilities are:
When measuring productivity, we consider the following:
Example #1 – LOCs committed into the code repository, with a dynamics by days
Example #2 – the number of closed tasks in the task tracker:
Example #3 - % of tasks accepted with no “returns”:
Example #4 -
While using this way of collecting metrics widely across the projects in EPAM, we also offer this solution to our clients.
When choosing certain metrics as our KPIs, we should also be aware of their mutually destructive pairs.One can easily find a complementing metric just by trying to corrupt the KPI you have chosen. For example, you can boost team velocity at the cost of quality (amount of bugs and tech debt).
Regarding the example at the beginning of this article, the “Lines of code added” metric may be “hacked” using a big number of comments, or duplicating the code, or using weird formatting where each keyword is on the new line. Even 20 years ago there were a lot of funny stories when programmers were corrupting this KPI just for fun, creating masterpiece pictures with different symbols in the multiline comments. Creating string constants by concatenating each symbol on a new line also helps a lot. Another great way to cheat the “lines of code” metric is to re-write your code several times, adding 5 commits instead of 1. On code review, nobody would really pay attention to the number of commits, they just watch the changes for the whole branch, so you can most likely get away with it.
Always think about how a given KPI can be corrupt because, most likely, it will be.
Together with “Increasing KPI A by xx%” your goals should always have “while maintaining or improving KPI B and KPI C”. This will make it harder to make mistakes and cut corners and will motivate your staff to have more perspicacity regarding the KPI.
Engineering productivity based on the amount, churn, cyclomatic complexity and other.
While using metrics to establish KPIs and track them is a very good move, metrics shine when it comes to overall vigilance about your processes. This means that certain combinations of metrics can give you clear signals of disbalance, negligence, or even incompetence. If you know how to use metrics in the right way, you can tell a lot just by looking at the team’s Burn-up chart or velocity diagrams. You can then prove your hypothesis by using more specific metrics like “Requirements Rewritten After Became Ready”, “First Time Right”, “Defect Containment” and others.
PERF (the metrics toolset mentioned above) contains a “medical encyclopedia” of common project issues and how to spot them, as well as comprehensive advice on how to fix them and see the result through the improvement of metrics and KPIs.
Let’s say we look at the following metrics to see these signals:
Defect Containment, if internal team skips a significant number of valid issues which are captured by the external team during product acceptance or upon release / in production
Bug Growth, if issues logged during active iteration are not being fixed right away and moved to the product quality debt
Open Bugs Over Time by Priority, if there is not enough attention to top-priority issues, i.e. the trend is not increasing or the “remaining open bugs” is stably high
Time in Status, if verification phases (Code Review, Testing, Requirements Review, Refinement, etc.) are suspiciously quick
In this case we can say with a lot of confidence that the team has No Quality Gates for phases: new feature verification, regression, pre-production validation, and production smoke checks.
That usually results in several issues at once:
Churn of software users
Failures in production environment
Inability to maintain and add new features to a product
As a business, we cannot let this happen, because software products are our most important assets. If you were doing real estate, you wouldn’t let the building staying leaking or rotting. These problems that seem little at once are getting compound and eventually lead to a loss of an asset.
By using PERF and complementary modules, EPAM Managers are able to spot problems at very early stages and fix them so that software development, delivery and maintenance is flawless and emotionally satisfying.
As you might have seen, personal metrics need different approach than project metrics, but what if we get higher above to the organizational metrics? The approach will be even more different.
People are different, so people metrics need to get normalized to be useful for a project. The same thing applies to organization. For example, you cannot use metrics like velocity in story points right away to get aggregated across the company, because each project has its own story points. They are all different. It’s like ignoring the difference between miles and kilometers when measuring a route from Edinburgh to Berlin.
If the difference in teams’ units of measure is predictable and dependent on the trait of a team (for example, logged hours heavily depend on team size in FTEs), then the representation of this trait usually ends up somewhere in the denominator of the metric before getting aggregated. If we don’t do that, average/median logged hours per team does not make much sense in most cases.
If it is not predictable (like story points – a team can choose whatever they want as a 1 SP), then we should normalize the metric values against some average indicators of this team. For example, doing average velocity in story points across the organization as a high level KPI makes no sense, while doing an average velocity trend – average of how well each team performs compared to this team’s best performance ever – will have statistical significance and value for getting the signals for PMO.
The examples above allow us to really trust these “averages” and “medians” across the company, but to be able to quickly find the problematic parts/teams and send help, we need a more advanced approach. Hit us up to find out how to do this using the TelescopeAI PERF toolset!