On Thursday December 17th from 01:00 to 02:40 (UTC) the Inspectorio Platform experienced a partial service disruption affecting the authentication and authorisation services availability. During that period, a limited number of users experienced issues with ability to login to Rise & Sight products (Web and Mobile applications). We met Service Level Agreement requirements for incident resolution and Uptime SLA.
Functionalities Impacted
During that period the following functionalities were impacted:
All remaining functionalities of the Inspectorio platform were not impacted by this incident including the live chat support channel. There were no data loss or corruption.
Incident Monitoring
During the incident period our logging system registered a spike of 100% CPU utilization of our Redis Memory Store from 01:00 UTC which last till 02:40 UTC and caused Service Unavailability errors.
Incident Cause Overview
The incident was caused by the issue in the latest optimization functionality released for Rise product (to prevent duplicate update request of an organization integration request). As soon as we started to receive a batch update to the Rise via Direct API it led to the 100% CPU utilization of our Redis Memory store which triggered deterioration of Redis availability for the Platform services in part of authorization and authentication.
Incident Recovery
When our team identified the outage of the Authorization & Authentication components through our internal monitoring system, we pinpointed the Redis service component to be the origin of the issue. We first attempt to scale up the Redis and trigger the vertical scaling of this service. We recognized that action had no impact on CPU usage improvements.
As per our protocol, we performed a rollback of the latest release of Rise functionality to the previous version. This resolved the issue and Redis CPU and other parameters were all back to normal. The QA team performed a full check of the Sight & Rise functionality and confirmed full recovery of the Platform.
Improvement Action Plan
We aim to implement following actions to prevent such incidents and minimize & mitigate possible risks as consequences: