Rise Platform Service Degradation
Incident Report for Inspectorio INC
Postmortem

On Thursday December 17th from 01:00 to 02:40 (UTC) the Inspectorio Platform experienced a partial service disruption affecting the authentication and authorisation services availability. During that period, a limited number of users experienced issues with ability to login to Rise & Sight products (Web and Mobile applications). We met Service Level Agreement requirements for incident resolution and Uptime SLA.

Functionalities Impacted

During that period the following functionalities were impacted: 

  • Login functionality to SIGHT
  • Login functionality to RISE

All remaining functionalities of the Inspectorio platform were not impacted by this incident including the live chat support channel. There were no data loss or corruption.

Incident Monitoring

During the incident period our logging system registered a spike of 100% CPU utilization of our Redis Memory Store from 01:00 UTC which last till 02:40 UTC and caused Service Unavailability errors.

Incident Cause Overview

The incident was caused by the issue in the latest optimization functionality released for Rise product (to prevent duplicate update request of an organization integration request). As soon as we started to receive a batch update to the Rise via Direct API it led to the 100% CPU utilization of our Redis Memory store which triggered deterioration of Redis availability for the Platform services in part of authorization and authentication. 

Incident Recovery

When our team identified the outage of the Authorization & Authentication components through our internal monitoring system, we pinpointed the Redis service component to be the origin of the issue. We first attempt to scale up the Redis and trigger the vertical scaling of this service. We recognized that action had no impact on CPU usage improvements.

As per our protocol, we performed a rollback of the latest release of Rise functionality to the previous version. This resolved the issue and Redis CPU and other parameters were all back to normal. The QA team performed a full check of the Sight & Rise functionality and confirmed full recovery of the Platform.

Improvement Action Plan

We aim to implement following actions to prevent such incidents and minimize & mitigate possible risks as consequences:

  1. Set up a topology of Redis isolated instances to improve stability and high availability of the system as an outage resilience configuration
  2. Deprecate Direct API usage as it has significantly less advanced performance control functionality and might impact our internal services directly. Should ask our clients to plan migration to Integration API solution (complete for Sight but still used for Rise)
Posted Dec 17, 2020 - 14:13 UTC

Resolved
This incident has been resolved.
Posted Dec 17, 2020 - 02:40 UTC
Identified
The issue is identified. Team is working on a fix
Posted Dec 17, 2020 - 01:00 UTC
This incident affected: Rise (Integration API, Mobile API, Web).