Pets Vs. Cattle Part II: Hint This Time it is the Applications

If you thought the Pets vs. Cattle saga was over in 2016, you'll be thrilled or disappointed, as the case may be, that I'm going to resuscitate the thread but in a different context.

Pets Vs. Cattle part I was all about the infrastructure. In the good old days, we used to give each server tender loving care (TLC) just like we would to a pet. Operating system provisioning and updates, BIOS updates, BMC firmware updates, peripheral (e.g. NIC) installation and configuration, RAID configuration, logical volume management, remote KVM access, IPMI management, server debug, and more were all performed manually on a per server basis. Applications would then be installed manually on a given server(s). During the 2009-2016 timeframe, the cloud architecture completely standardized and automated the entire server management task. Server management was now akin to managing cattle, hence the term. Applications were no longer installed on a particular server, instead they were deployed on the "cloud" and the cloud layer—public cloud, OpenStack, or Kubernetes (K8s)—would take care of placing the specific VMs or containers onto individual server nodes (amongst other things).

However, applications have continued to be treated as pets. Each application receives TLC. Even in cattle-infrastructure aka a cloud framework, applications are installed onto an individual cloud using a declarative template such as Terraform, Helm Charts, or OpenStack Heat and configured using manual techniques or tools such as Ansible. Service assurance, or fixing problems, has revolved around humans looking at application dashboards (also called Element Management System or EMS) and alerts, and closing tickets manually.

Let's do a thought experiment to see how well a pets approach works for application management in the edge computing context. Let's assume 1,000 edge locations and 200 K8s applications per edge location. An application could be a network function like a UPF, AMF, SMF, vFirewall, SD-WAN; or a cloud native application such as AR/VR, drone control, ad-insertion, 360 video, cloud gaming; or AI/ML at the edge such as video surveillance, radiology anomaly detection; or IoT/PaaS framework such as EdgeXFoundry, Azure IoT edge, AWS Green Grass, XGVela. Furthermore, assume that the number of application instances go up to 1,000 per edge site with 5G network slicing. So this means, there would be 1,000,000 application instances across all the edge sites in this example.

Here is the impact on application management:

Initial orchestration: The Ops team would have to edit 1,000,000 Helm Charts (to change Day 0 parameters), log into 1,000 K8s masters, and run a few million CLI commands. Clearly, this is not possible.

Ongoing lifecycle management: Log into a 1,000,000 dashboards and manage the associated application instance (since very few application management dashboards manage multiple instances) OR run 200 Ansible scripts 5,000 times each with different parameters which means executing the scripts a 1,000,000 times. This is not practical either.

Service assurance: Monitor 1,000,000 dashboards and fix issues based on tickets opened. This is also not feasible.

Keep in mind, actual edge environments could scale even more. There could be 100,000 edge sites and 1,000 applications. mushrooming to 10,000 application instances per edge site with 5G network slicing. If you are thinking this scale is a pipe-dream I'd remind you of Thomas Watson's comment from 1943 where he said, "I think there is a world market for maybe five computers."

So what's the solution to this seemingly impossible problem? Join us for the "What's New in AMCOP 2.0" meetup on Monday Feb-15, 2021 at 7AM Pacific Time for the answer or see the next installment of this blog series next week.