Need Agility? Buy Programmability!

Community

Which of these situations do you recognize?

  1. All the dials are in the red. You realize you are under DDOS attack. Wonderful that your friends over at the firewall company are willing to give you all the licenses you need including all the latest features, and they will even help you operate the defense. All you need to do now is to spin up a line of virtual firewalls. That was easy. And reroute all the traffic through them. A little harder. There are hundreds of devices in the network that have to be reconfigured, and it needs to be done right. And preferably rather soon.
  2. The phone call was from our newest financial customer. Not the biggest deal ever, but clearly an important customer. Now they had changed their mind about how two of their Japanese branch offices would be connected to a broker in Singapore. The deployment order is already halfway through, with some truck rolls on the way, some virtual equipment already spun up in our data center, ports, addresses, and paths already allocated. The easiest way would be to cancel the order and make a new one, but that would surely delay the final delivery date.
  3. The guy was right; our core network was in the second half of its economic life. To replace it with his new gear, which also takes care of all our network security needs, to a lower cost than our current equipment vendor’s latest line equipment is definitely tempting. But imagine all the work we’d need to invest in retooling our management infra. For a while, we’d even need to support two vendors in a single network. A dream situation turned into a nightmare?
  4. It’s all over the news in Germany. It’s all over the news in Sweden, Italy, and Australia. Lists of which private cloud providers are using green vs. dirty power. And all the major corporations who are their customers. It’s now only a question of time, and not much of it, before we’d end up in those lists. Or so said our COO, at least. “How soon can we stop using that cloud provider?” Well, it’s only all our databases that are running there. A small matter of connectivity. How hard can it be?
  5. We should have known better. To keep adding customers without building out the infrastructure was bound to make the system hit limits at some point. That service provisioning failed occasionally was not really a critical issue. Hard to eliminate completely, and it was only a small fraction anyway. The real issue was that every time it failed, it didn’t clean up properly. Over time, more and more resources, such as IPs, remained allocated to no service. Ports connected to nobody were left open in the ACLs. Not sure how the hackers found out, but our logs confirm a full database copy, destination TOR. Who should I tell first?
  6. Port Hedland. Never heard of it before. Apparently, that’s where our transoceanic fiber from Singapore comes ashore in Australia. With the massive fires going on there now, it will surely take weeks before power and civilization return to normal. Poor locals. Anyway, all our Australian customers’ services need to be temporarily remapped to tunnel over some other carrier. It’s pretty easy to update our provisioning script to do this for new customers, but existing customers with their variations in services deployed will not be easy. We could unprovision and reprovision all of them now. The customers are currently down anyway, so they wouldn’t mind. But it would clearly not be popular to subject them to another outage in a few weeks, as we switch them back to our fiber.
  7. Wow. 68%. That explains quite a bit. That was the auditor’s assessment of the correctness of our database vs. what services were actually running in the network. No wonder so many provisioning requests failed. And worse, some went through but did unexpected things instead. Finally, we have a corporate decision to reconcile our database with the network and make sure it stays up to date. Correctness in our data is a necessity for keeping customers and working efficiently.

Did you spot the common theme in all these scenarios? The need for business agility. The ability to work with the unexpected. The possibility to get to a better place.

If you were ever involved in automating anything with a computer, you know it’s easy to script a sequence of actions, a workflow, and then put it on repeat. Your robot has no eyes, but as long as the environment stays as expected, your robot can be very productive. Need to remove what the robot created? No problem, just build another robot to do that. Need another service? Just add two robots. Need to switch customers between your two service types? Just add a robot for modifying the aspects of the service that are different, and one more for going in the reverse direction. Oh, the service modification may fail at points B, C or D? Just add robots that can take the service from each of those intermediate states back to the original state. Oh, you’re under DDOS attack and need to insert a firewall into the path. You have 86 robots to update. Let’s stop talking, you have a lot of work to do.

There is nothing wrong with workflows. We all need them, use and love them. Well, maybe the love part has its ups and downs. Anyway, the problem with workflows is that they are great at a high level, so people like to create them. But the deeper into the details you get, the more problematic the workflow approach gets. The more details we get into, the more additional steps and dependencies in the workflow to keep track of. And the more arrows going between those steps in your workflow diagram, the more robots to deal with. There is even a word for this. It is called state explosion. Each arrow corresponds to some robot action that you need to program, maintain, test. And that may need to be updated “preferably rather soon” at times.

Long term, the worst thing with workflow state transitions is that they can only handle the expected. If ever something unexpected happens, the robots have no eyes and no logic to understand the bigger picture of what’s going on. You can of course always keep adding, trying to foresee more and more possible states and failure conditions. Surely some of your more creative customers will be helping you with this. Still, your robots won’t be able to handle anything you didn’t already think of. In a traditional, mature, workflow centered network management system, a typical situation is that around half the code is for detecting and recovering from errors. The cost of testing this code alone makes me shudder.

To get to solid ground we need to bring in the computer scientists.

One observation the computer scientists have made is that low level, detailed instructions are only preferable when the manager is a greater expert than the worker in the worker’s field of expertise. Further observation shows this is rarely the case. In general, the manager attains better results by communicating goals rather than sequences of instructions. Computer scientists call this way of working declarative. You declare what you want the end result to look like, but leave open how to get there. The former approach is known as micromanagement.

The other computer science term we need is transaction. A transaction is an abstract concept defined to be atomic, consistent, independent and durable (ACID). The point with the transaction is that it gives the wielder of the transaction some guarantees that are very nice. The most well-known property is that the transaction either does all you asked for, or nothing. Never anything in between. Combine this with a workflow, and suddenly we have something very attractive. A sequence of steps where each step either happens, or it doesn’t. No code to clean up some mess that fell out in between. This keeps the state explosion away, and it can handle unexpected, unforeseen situations very well.

The final computer science term to include is the data model. Data models describe everything a client needs to know about an interface. In order to not get astronomical in size, usually data model-driven interfaces have a small number of operations (verbs) that can be applied and combined in a smaller or larger set of objects (nouns). When an interface is data model-driven, it means that a client can use the interface efficiently based on the data model alone. It has all the information needed to work the interface. This enables transaction engines to figure out how to go from one state to another without requiring a human programmer with domain knowledge to split that up into steps.

While some people think programmability happens as soon as you add a (RESTful) API to your software, we believe it is only meaningful to talk about programmability when the API delivers business agility. Agility that does not depend on quick fingered humans in order to be agile. APIs that are solid enough be reasoned about, consistent enough to be computed with and precise enough to be depended upon. Programmability means declarative, transactional, and data model driven APIs. A name comes to mind. NETCONF/YANG.

Leave a Reply

avatar
  Subscribe  
Notify of