October 2023

AWS Blu Insights - Operational Excellence

Building high-quality software requires rigor and a firm commitment to excellence. Our deep belief in the importance of quality assurance (QA) drives us to implement robust practices to ensure our products meet the highest standards before reaching customers. To maintain the highest bar of quality, we employ diverse mechanisms: a comprehensive and robust workflow, a multitude of tests (unit and end-to-end), meticulous reviews (code, security, and performance), and bug COE (Correction Of Error).

Let's look closer at the workflow we adhere to. It is divided into 13 steps which are covering: 🔎
 

  • Setup: the feature is defined and specified, potential impacts on performance, security or existing components are identified and architecture choices are described. The most important part at this step is to make sure we are building the right thing following the right way. Scalability and stability of products that do not deliver the expected outcomes are useless.
  • Validations (Setup, Feature, Code and Security): we ensure that the development is consistent with the setup, contains no bugs, performance degradations, or introduced security risks mixing manual and automated mechanisms. The task owner will seek advice and guidance from other team members, fostering knowledge sharing and team building. They have the ownership of the work done and its impacts. During this validation process, modifications may be requested, and thoroughness is highly encouraged since this phase is crucial to ensure no regression or bug is introduced into the product. During code validation, security guardians are involved to ensure that the code complies with AWS security standards.
  • Environment testing: both environments, Canary for preproduction validation, and Flamingo for the production, are continuously tested manually. Canary is additionally automatically scanned and tested daily. We strive to be one step ahead of the client in identifying bugs by catching them before reaching the production environment. We count over 3,000 tests running with Cypress, Playwright and Junit for application, classification and dependencies with daily team reports.
  • Documentation: thorough documentation accompanies all new features. It serves a dual purpose: empowering our customers with the information they need for efficient and transparent usage, and communicating the benefits of new features and their adoption.
image.png

Workflow from setup to Canary
 

At first sight, this workflow may appear complicated and time-consuming, but this investment pays off daily. Thanks to this commitment to quality, we are continuously improving our software while encountering less than one issue per week among 750+ active accounts.

This workflow also allows us to anticipate potential incoming problems or challenges to overcome. For instance, we implemented a dependencies improvement (see Big graphs just got bigger), before dealing with multiple customer tickets about this. This approach also ensures that we deliver reliable Classification and Dependencies analysis, while continually expanding languages and statements support.

Fixing issues is a major point which is considered in our workflow. Issues arise from various sources (e.g. users and ticketing systems). To ensure issues are fixed and prevent recurrence, we meticulously describe each issue, identify the scenarios and impacts, and schedule meetings with the involved engineers to discuss the COE. The main points are: What happened? Why? And how to avoid this happening again? Our aim is to identify the root cause, create generic solutions, and reduce the number of similar issues permanently.

Identifying issues is a key point in our quest for quality, especially when our aim is to identify and address them before our customers do. To achieve this, we orchestrate monthly BugBash sessions, where the entire team collaborates to “break” the application. We've found this team-building exercise to foster team cohesion while purposefully challenging our product's integrity. All major findings are prioritized and addressed in the following days, if not hours. 

Operational excellence is not only about issues. It is also about SLA (Service Level Agreements), response times, and availability of the service. By leveraging native AWS Services such as ECS, Fargate, EFS and more, AWS Blu Insights architecture ensures the expected quality. We also continually introduce new mechanisms to reduce the cost of the service infrastructure (see Scaling out/in policies and task protection in practice) without compromising the quality of the service for our customers.

Building innovative services for Mainframe Modernization is challenging, with strict requirements from all stakeholders. We are at the beginning of the journey. While we acknowledge the long roadmap ahead, we remain firm in our commitment to provide a service of the utmost quality. A huge thank you to all my colleagues from the service team for their rigor and commitments, and to our active users for their feedback and use cases.
 

Thanks for reading!