How to Solve the API Variant Problem (Part 2)

Net API Notes for 2024/03/15, Issue 234

Fiscal belt-tightening and future uncertainty have got accounting and technical leadership taking a refreshed look at every API expenditure. And when it comes to targets, variant APIs - or two or more APIs created by different teams that appear to largely serve the same purpose - are in the cross-hairs. 

In the first installment of this series, I illustrated the API variant problem with examples and then detailed why it is so incredibly common in mid-to-large enterprise environments. In this second part, I'll explain why most efforts to solve the variant problem are prone to failure. I'll also lay out a lightweight approach for approaching the variants in your ecosystem - one that is more likely to achieve the ends you're hoping for. 

That, and more, after the jump.

A quartet of Loki variants from Season 1 of the Disney+ MCU show, Loki.

Enterprise-Wide Decrees are Blunt Instruments

Leadership may perceive that API variants increase development time, reduce developer productivity, and increase the odds that mistakes will happen. Maintaining a laissez-faire approach (or allowing development teams to continue doing whatever they want, including making additional variants), is not a scalable strategy. Surprisingly, however, neither is mandating that all variants must be pruned for the greater good - especially if they already have in-production clients.

Consider that reconciling multiple versions of an API may be resource-intensive. Everything from diff'ing the interfaces to tracing the data back to their respective sources to ensure a true apples-to-apples comparison takes time, often from some of the busiest subject matter experts. 

It can also be incredibly disruptive. Picture it - you're a team lead, and you've got a backlog as long as your arm; management needs to see tangible movement on strategic Northstars A through E, and for whatever reason, the service has a 10% of inexplicably falling over every other Tuesday at 3am. And then someone comes calling asking you to halt everything? Which, best case, you'll be expected to take on those problems from another team? Or, in the worst case, they take your project, leaving your team in limbo? 

Finally, while there may be long-term benefits to a reconciled API, the short-term perception is hard to justify to clients. Suddenly, the internal implementation details are exposed, breaking the point of abstracting away the implementation details. An API consumer is right to ask, "You're going to stop new feature development and introduce a tremendous amount of risk… to offer the same thing you did before? Because you've got - if not technical - conceptual debt?"

There may be cases where a senior enterprise architect, engineering director, or technical product VP has the hierarchical power to decree all variants will be pruned, and the organization obediently obeys. However, anything less than lock-step adherence will only add to the swirl, proving to be a costly distraction. 

There are cases where the increased ecosystem complexity, redundant maintenance, and fragmented user experience should be addressed. But, if we are to be selective, how do we identify cases where the cost is justified? 

An ABC Approach to Sizing the Variant Problem

Activity-Based Costing (ABC) is a model for totaling overhead activities' costs and assigning those costs to services. For those that want to do a deep dive, there can be a lot to it. However, given what we're trying to accomplish here, I'm going to borrow the useful bits and leave the rest as an exercise for any CPAs who might be following along.  

"All models are wrong, but some are useful." - George Box

Before we begin using a modified ABC method to size our variant problem, we need to review a few basics. First, the ABC model has four levels:

  • Resources - The expenses associated with operating the API. This includes costs like staff, cloud bills, and licensing costs for monitoring, gateways, etc. They are nouns or the things doing work.
  • Activities - All operational and infrastructure tasks within the IT department used to support the API, such as documentation writing and maintenance, performance optimization, patching security vulnerabilities, and rehydration of cloud instances, among other things. These are verbs, or the things done.
  • Services - Are the APIs made available by the team, regardless of whether they're traditional REST-ish, RPC, GraphQL, or something else. If the team also provides things like SDKs or libraries, they would also go here. These are the things offered to clients.
  • Clients - All the consumers of the APIs, regardless of whether they are internal or external. These are the reasons we made the thing.

The next thing to know about ABC is that clients consume services, services consume activities, and activities consume resources. Squint, you may just see how well this ABC approach aligns with an API's "outside-in" prioritization. 

Finally, starting with stating costs can seem like a gigantic task. To make it easier to unravel that giant hairball, let's start with something known and easily quantifiable, like clients, and work our way backward into greater levels of specificity until we have enough fidelity to make a confident decision. 

Fig 1. An Illustration of the ABC model's four main groupings and their relation to each other.

Using the ABC Model To Determine How Long to Recoup Consolidation Costs

To begin using this model, we first want to derive a ballpark idea of what each variant costs to maintain, independently. We start by listing the clients for each in a table beneath each variant's name. 

Variant API A

Variant API B

Internal Team 1

External Partner X

Internal Team 2

Internal Team Y


Internal Team Z

Next, we move to services. With each step, we add additional detail. For each client, we want to capture which services are used by which API.

Variant API A

Variant API B

Internal Team 1

External Partner X

API requests: <10 TPS 

SDK

Internal Team 2

White-Glove Service

API requests: ~50 TPS

API requests: ~100 TPS


Internal Team Y


API requests: <10 TPS


Internal Team Z


API requests: <10TPS

Already, we can begin to see some interesting things about this hypothetical example. For one thing, Variant B has some significant usage. Moreover, if that variant is supporting some form of White-Glove service, there are most likely contractual elements in play that we should be very wary of disrupting.

That said, we still need to go further to have a clearer idea of the ongoing costs associated with each variant. Here, we want to list not only the activities required to support each service but also the approximate hours. These don't have to be exact hours, but they should be within an order of magnitude.

Variant API A

Variant API B

Internal Team 1

External Partner X

API requests: <10 TPS 

SDK

Internal Team 2

Code & Security Updates: 16h / month

API requests: ~50 TPS

Compatibility Testing: 16h / month


White-Glove Service


Metrics Discovery & Reporting: 8h / month


Strategic Planning Support: 4h / month


API requests: ~100 TPS


Internal Team Y


API requests: <10 TPS


Internal Team Z


API requests: <10TPS



Servicing All Client Requests


Patching: 8h / month

Patching: 16h / month

Documentation: 4h / month

Documentation: 16h /month

Monitoring Analysis: 8h / month

Monitoring Analysis: 16h / month

At this point, your spreadsheet might have many more rows. In our final step, we'll convert our hours spent doing activities and convert them into a cost. We'll also add in fixed costs (cloud hosting, licenses) where known. Talk to your resident budgeting and forecasting expert to discover what they use for a developer's hourly rate. If you don't know who that might be, (1) you should find out who that is and ask to have coffee, as they're good people to get to know before you need them, and (2) use something plausible to keep the math simple. 

Variant API A

Variant API B

Internal Team 1

External Partner X

API requests: <10 TPS 

SDK

Internal Team 2

Code & Security Updates: $1600

API requests: ~50 TPS

Compatibility Testing: $1600


White-Glove Service


Metrics Discovery & Reporting: $800


Strategic Planning Support: $400


API requests: ~100 TPS


Internal Team Y


API requests: <10 TPS


Internal Team Z


API requests: <10TPS



Servicing All Client Requests


Patching: $800

Patching: $1600

Documentation: $400

Documentation: $1600

Monitoring Analysis: $800

Monitoring Analysis: $1600

Non-Labor Costs (monthly)

Non-Labor Costs (monthly)

Compute: $146 

Compute: $292

Network: $45

Network: $100

Storage: $10 

Storage: $10

TOTAL: $2201 

TOTAL: $9602

Note: In this example, labor costs dwarf the non-labor costs by quite a bit, even while - most likely - underpricing a developer's hourly rate at $100/hr to make the math simpler. However, I fully acknowledge there may be several additional items (tools, licenses, Auth providers, etc.) that I'm not accounting for here. 

Determining Time to Recouping Costs

Looking at our two columns between Variant A and Variant B, we may be inclined to suggest that users of API A migrate to using API B. It already supports other internal teams, has additional support capabilities due to the external partner's usage, and would seem to be able to scale to support API A's TPS. 

In reality, we would hope to see some economies of scale having API B handle the increased workload. But since that's a bit handwavy, let's model API B adding API A's Non-Labor Costs to its monthly operations. Our cost savings consolidating the two variants to a single, canonical API becomes the monthly labor associated with API A or $2000 per month. 

Servicing All Client Requests


Patching: $800

Patching: $1600

Documentation: $400

Documentation: $1600

Monitoring Analysis: $800

Monitoring Analysis: $1600

Non-Labor Costs (monthly)

Non-Labor Costs (monthly)

Compute:  

Compute: $438 ($292+146)

Network:

Network: $145 ($100+$45)

Storage:  

Storage: $20 ($10+10)

Assessing the Feasibility of Pruning Our Variant

Setting aside subjective costs like fragmented discovery experience, we have a hard number for what consolidation may net us in this one case, along with solid suggestions on which way work should flow. Next, we need to think through what it will take to prune variant A. 

A quick list of the activities involved with reconciling the two APIs into a single item would include:

  1. Analysis and documentation of endpoints, data models, security protocols, etc.
  2. Map current state to next state
  3. Where necessary, plan the interface updates to a future version of API B 
  4. Prepare the test strategy and documentation updates
  5. Create and communicate the Deprecation and Sunset strategy to API A's clients
  6. API A's Client teams will need to modify their code to consume the new API
  7. As the cutover date approaches, API B monitors API A usage and escalates when appropriate
  8. API A's Team follows established end-of-life procedures for API A

As you can imagine, this could be a fair amount of work with a fair amount of variation. However, using our ABC model calculation, we can better assess the feasibility of this course of action. Suppose our variations are minor and the teams will need to invest a minimum amount of time fulfilling those activities - say, 240 hours across all team members, or $24,000 (or 240 x $100; if that sounds like a lot, keep in mind that with an average of 22 work days, a team of 8 devs represents approximately 1408 person-hours per month - and I guarantee that even a 'quick' consolidation will be lucky to be done in that time). Using our quick calculations, we see that a "canonical" API would break even in a year ($24,000 divided by $2000 / month = 12 months).

Now, suppose that analysis determines that there are actually some tricky edge cases here - maybe there's a different authentication scheme being used, or the use cases - despite using the same terminology - are actually pulling from data sources with vastly different refresh rates. The hours, all total from accommodation to client modification, come in closer to 2400 hours - something that could easily represent a couple of month's work. Multiplying by our developer hours, we get $240,000. Dividing by our ABC model, we see it would take 120 months, or TEN YEARS, before we could claim the benefits of cost saving

With numbers on that scale, we should have an honest conversation as to whether that is even practical for this API's operational life.  

Conclusion

I'm the first to admit that this approach is crude. For example, it doesn't account for the ROI that the team behind API A might generate after they're free to do other work. The example only consolidates two APIs; additional APIs would result in a higher ABC model number. There's an opportunity cost associated with doing this reconciliation work I didn't model. Oh, and the number of hours clients need to rewrite their code to accommodate a newer version of an API is pretty darn squishy. 

However, as economic conditions compel API leadership to justify expenses and find efficiencies, we need more systematic methods of engaging in system-wide change than what "feels" like a win. Managing via theoretical "vibes" is no way to make important decisions. With a few critical pieces of data, anyone could take an afternoon and get a sense of the feasibility of pruning their variant APIs. 

Of course, the easiest way to solve the problem is to prevent it from happening in the first place. Or getting targeted and asking if there are ways to reduce the labor costs for the repetitive, ongoing support activities - now that we have the ABC list and see the activities common across the portfolio, is there a way to automate patching and cut monthly labor costs in half? If you're looking to improve efficiency, that might be an easier road to hoe; more impactful in a shorter amount of time, too . But both are notes for another time.

Milestones

Wrapping Up

And now for something different! You might recall that back in January, I helped produce #APIFutures, a collaborative event challenging dozens of API folks from around the world to predict the largest opportunities (or challenges) for the coming year. Fast-forward and I'm now participating in the "Never Break the Chain" writing challenge, organized by architect, teacher, and author Diana Montalion. The goal is simple: WRITE EVERY DAY IN MARCH.

Whereas Diana and ThoughtWorks's Andrew Harmel-Law are using the event's peer pressure to make positive progress on their respective books (Learning Systems Thinking and Facilitating Software Architecture, respectively), I chose something oblique: re-interpreting Aesop's Fables as technology-related tales from the early 22nd solarpunk century. Why? 'Cause life is short and I need more experiences outside my comfort zone. Each workday in March, I'm posting an upcycled fable to my website. If you give these a gander, let me know. 

But that's something else. If you'd like to become a paying subscriber to THIS work, check out the subscription page. A monthly or yearly pledge helps Net API Notes remain ad-free and ensures previous commentary, analysis, and insights stay available for all. 

That's all for now. Till next time,

Matthew (@matthew in the fediverse and matthewreinbold.com on the web)

Subscribe to Net API Notes

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe