CI/CD & Tenant Provisioning System

-
CI/CD and Tenant provisioning system


Project Goals

The goals of this project were to increase release velocity of a legacy applicatino that had been recently refactored to use per-tenant containers rather than a multi-tenant monolith backend. The issues present were:

  • Releases were manual and taking hours to build
  • Deployment of the code was not automated and prone to failure
  • Deployments lacked proper integration to AWS services (RDS, S3, SSM Parameter store)
  • Tenant deployments were manual and complex.
  • Certificates were causing a significant amount of issues due to provisioning with a 3rd-party vendor that needed manual work.

To properly address these needs, I did the following:

  • Analyze the current skill set of the dev team and operational maturity.
  • Talked to the infrastructure team and discovered the rules of engagement for the operational system
  • Analyzed the build stages to find ways to optimize out the delays
  • Fully mapped out the tenant provisioning requirements

Project Work

Once completed, I was embedded with the team and fixed issues:

  1. Fix the build.
    1. We began with fixing the build system by using selctive git checkouts, optimizing the dependency system, and woking through the build process to parrallelize the build where we could and generally cut the time down.
    2. Remove any configuration from the build artifacts as this would need to be added per-tenant down the road.
    3. Set the build to only generate artifacts and push them to AWS ECR.
  2. Design the per-tenant infrastructure.
    1. Understand the isolation requirements.
    2. Create minimal infra required to service a tenant, in this case an Aurora cluster that could be expanded, S3 space, a pod group in Kubernetes, some SSM Parameter Store configuration, and all appropriate IAM roles.
    3. Setup the tenant’s AWS infrastructure as CloudFormation templates.
    4. Setup their Kubernetes infrastructure as a Helm template.
    5. Document the steps to deploy.
  3. Automate tenant provisioning.
    1. Given the skill set of the programming team, the deployment code was written in Typescript so it could be maintained after my departure from the project.
    2. Create a state machine system to handle infra deployment and provisioning steps. This involved using a table in DynamoDB to store state info for all the deployments in a region, segragated by deployment emvironment.
    3. Created logic to handle the normal state transitions.

Challenges Encountered

During this project the following issues were encountered and had to be worked through:

  • As this is a large company, we were constantly going up against various internal compliance requirements that added complexity to the deployment. Some examples:
    • All deployments had to properly support various on-container security software. These were usally added during container build and factored into the budget for deployment times, and reliability.
    • The client would not issue a wildcard cert for production use, so I wrote a dynamic certificate provisioning system that would minimize the number of certificates needed, and auto apply them to AWS ALBs. This software would request, validate, issues, and apply the certificates to an ALB to comply with the client and AWS’s needs.
    • Instrumentation was gated behind another team that ran DataDog. We had to sweet talk them into letting us have access so we could more efficiently debug.
    • The base Kubernetes infrastructure was run by another team, so some modifications were simply not allowed, or required having to work around the limitation they had made on the system. These included things like dynamic scaling limits that prevented rapid scaling during mass updates, or how permissions were handled.
  • The project was initially only for the CI/CD portion, but deployments had no way to do a per-tenant deployment. The Client realized this issue when the contract’s initial goals were reached, and it was extended to further improve the deployments.
  • Client requirements required a full backup before deployments. We had designed the system to handle incremental deploys and rollbacks. The full backup requiremnt and implicit point-in-time restore requirment made it necessary that production had a short outage during upgrades (2-3 minutes per tenant, staggered as deployments rolled out)
  • Client had initially intended to self manage the project with existing staff, but pivoted to having an embedded Ops person to handle deployment and provisioning issues. This unfortunately meant that Typescript was not the best choice for a language.

Project Outcomes

Upon completion, the client benefited from:

  • Increased deployment frequency from weekly to 5+ times per day in all pre-production environments
  • Rapid provisioning of new tenants within minutes (excluding CAB process)
  • Fully automated certificate management
  • Centralized system functions via the Azure DevOps console

Key Takeaways

  1. Custom provisioning software, while effective initially, may not be suitable for long-term maintenance by non-developers. Alternative solutions, such as AWS Step Functions or bash scripts, may be more effective.
  2. It is essential to prioritize implementation over debating security concerns without a clear threat model.
  3. Embedding with the team is crucial for achieving optimal project results.

Talk To Me

Contact Details

Need quick advice, or direction on a cloud architecture problem? Send a message and we’ll figure out a game plan. Please add as much detail as possible, and a reliable way to contact you. Thanks!

Boston Area, Massachusetts, US
@DansHardware