How we measure Engineering Productivity

Samuel Akinwunmi

Founder of Bilanc

Follow us:

How do we measure productivity?

In engineering, there is no perfect measure for productivity. There are plentyofarticles… that discuss why traditional productivity metrics are flawed. Instead, we use LLM’s to estimate the amount of effort required for every merged PR.

When a PR is merged we run an ephemeral workflow to index the PR. We calculate metrics (e.g. Cycle Time), and we use an LLM-based binary classifier to retrieve relevant code from the repository, using metadata from the previously indexed merged PRs as context. We also pull information on the tasks from JIRA & Linear.

We pass all this context to various smaller agents, e.g. a PR summarising agent, a PR tagging & categorising agent, and an agent that identifies risk & concerns in the PR. All of these outputs are fed into our PR effort estimation agent that calculates a productivity score between 0 - 100. We store the LLM reasoning with each PR ID. This gives our users an understanding of why the LLM made certain comments, or gave a PR a certain productivity score for example.

There is no perfect measure for productivity, but we pride ourself on being as close as possible to the “reality on the ground”.

A quote from one of our customers :)

How do we validate the results from the LLM?

We generate manual eval data internally. We generate PR - Summary, Tags, and Productivity Score pairs, annotating everything manually. This is tedious, so we’ve combined it with LLMs we use for evals (LLM as a Judge). We prompt the LLM to judge the results of more deterministic agent tasks (e.g. Code Categorisation). We continue iterating on prompts until we pass a certain threshold. This helps us move a lot faster.

This all used to be one big prompt, and one of the benefits of breaking our agents into smaller agents that work asynchronously is that we can add, remove, and continuously measure and optimise each agent independently. This means we can test new metrics really, really quickly.

How can we improve

  • Deeper codebase understanding

    • At the moment we use a lot of custom context management to build a good understanding of each Pull Request. We chunk files and store LLM-generated summaries to help our classifier with retrieval. We’re exploring upgrading this to a graph search and including static analysis (ASTs) to improve relevance.

  • Improving Dataset generation for evals

    • All of the ground truth data we use for evals comes from internal data (our own, or LLM generated). Even though we have over 1,000 PRs annotated/judged by LLM, we know that this isn’t perfect. We are looking to explore using outsourced dataset generation for some of these tasks. We’re slightly skeptical of the generalisability, but we think it’s a step-change above using our own data, and we’re also excited for the potential dev velocity gains we get.

Appendix

# Pseudocode for Measuring Productivity on Merged PRs

class ProductivityWorkflow:
    def __init__(self, pr_id):
        self.pr_id = pr_id

    def run_indexing(self):
        pr_index = index_pr(self.pr_id)
        metrics = calculate_metrics(pr_index)
        return pr_index, metrics

    def retrieve_relevant_code(self, pr_index):
        relevant_code = LLMClassifier(pr_index.metadata, historical_pr_data())
        return relevant_code

    def pull_external_context(self):
        jira_context = fetch_jira_context(self.pr_id)
        linear_context = fetch_linear_context(self.pr_id)
        return merge_context(jira_context, linear_context)

    def run_code_summarization_pipeline(self, context):
        code_summary = code_summary_agent(context)
        pr_tags = tagging_agent(context)
        architectural_risks = risk_assessment_agent(context)
        effort_estimate = effort_estimation_agent(context)

        return {
            "summary": code_summary,
            "tags": pr_tags,
            "architectural_risks": architectural_risks,
            "effort_estimate": effort_estimate,
        }

    def store_results(self, metrics, ai_outputs):
        store_in_database(self.pr_id, {
            "metrics": metrics,
            "ai_outputs": ai_outputs,
        })

    def process_pr(self):
        pr_index, metrics = self.run_indexing()
        relevant_code = self.retrieve_relevant_code(pr_index)
        external_context = self.pull_external_context()

        context = merge_context(pr_index, relevant_code, external_context)
        ai_outputs = self.run_code_summarization_pipeline(context)
        self.store_results(metrics, ai_outputs)

# Example usage:
workflow = ProductivityWorkflow(pr_id=12345)
workflow.process_pr()