title | description | layout | subcategory | category |
---|---|---|---|---|
Platform On-Call Guide |
Runbook/guide for rotations/responsibilities for the Login.gov Platform engineering teams. |
article |
How To |
Platform |
To help balance the different workloads across the Login.gov Platform teams, we have multiple 'oncall'/help roles with weekly Rotation schedules. This allows us to provide our customers (primarily the AppDev engineers) with timely and comprehensive assistance, and to help strengthen our teams' knowledge base and comfort with the various tasks and responsibilities involved in the Platform teams' work.
Rotation / Paging Schedule Name | Slack Handle | Slack Main Channel(s) | Coverage | Notes |
---|---|---|---|---|
Platform OnCall - Primary | @login-platform-oncall | #login-devops ** / #login-events |
24/7 | Top responder for Platform issues |
Platform OnCall - Secondary | @login-platform-oncall | #login-devops ** / #login-events |
24/7 | 5 minute delay backup for primary |
Interrupts | @login-platform-help | #login-platform-help |
Business Hours | Developer support and toil |
Deployment | @login-platform-deployer | #login-devops ** |
Business Hours | Release manager for identity-devops code |
DevTools | @login-devtools-oncall | #login-devops ** |
Business Hours | GitLab and automation specific support |
All schedules rotate at 1300 (1PM) Eastern Time every Tuesday, and are signaled by an automated message in the #login-devops
** Slack channel, e.g.:
![Screenshot of weekly Slack message for Platform rotation handoffs]({{ site.baseurl }}/images/oncall-rotation-slack-weekly.png)
Mission: Take care of production!
- Oncall Guide Quick Reference - emergency contact list and other private information
- Incident Response Checklist - when an incident arises
- Troubleshooting Quick Reference - when you are troubleshooting and not sure where to start
- Platform Rotations in AWS Incident Manager - to check who is on call
- Acknowledge pages - ACK pages within 5 minutes (if possible) to ensure a timely response and to avoid rollover to the Secondary On-Call
- Appropriately respond to alerts - Assess an alert's impact to end users and service providers and judge severity, acting as Incident Response reporter/Situation Lead if appropriate
- Check production (
prod
) environment - Review systems and logs for indicators of issues which are not yet monitored, or unexpected behaviors - Alert
@login-appdev-oncall
if production may be impacted - Make sure they are aware anytime things are going poorly in production - Initiate Incident Response (IR) process - Act as Situation Lead/Incident Commander following the [Security Incident Response Guide]({% link _articles/incident-response-guide.md %})
- Monitor Channels - Keep an eye on
#login-events
for problems requiring response or investigation - Review any open PRs that have been sitting over 48 hours in
identity-devops
,identity-terraform
,identity-base-image
, oridentity-cookbooks
- Ensure clean handoff of ongoing issues - Review and update as is appropriate in the LG Platform - Interrupts board
- Discuss prior week's issues in Tuesday 1300ET handoff thread in
#login-devops
** - Maintain the
@login-devops-oncall
group - Update the handle at the time of the weekly Handoff Boundary - Take care of your well being - You are but one human, and the team is here for you! Your health and relationships must take priority over on-call responsibilities. If being on-call is causing harm, let the team know immediately.
Do these as you enter the Primary On-Call rotation:
- Update the
@login-devops-oncall
Slack group handle - In#login-devops
, click on@login-devops-oncall
in the channel topic, and then edit the list of users to match the new Primary and Seconday On-Call engineers, as per the schedule in AWS Incident Manager - Discuss recent issues with previous Primary On-Call engineer, if any
- Review the
prod-idp-workload
CloudWatch dashboard- Look for errors, latency spikes, or any other unusual activity
- Improve your sense of what "unusual" and "usual" events look like by zooming out
- Open PRs or track issues in
identity-devops
to adjust problematic alerts or fill critical observability gaps- Alert fatigue is real, so let's fight it!
- Not being able to understand what is happening in the system is stressful, so let's improve observability!
As you exit your Primary On-Call period:
- Discuss recent issues with the incoming Primary On-Call engineer
- Reflect on this On-Call period:
- Asses the stress level you experienced
- Suggest improvements to on call process, docs, etc
- Share your experience(s) in the weekly Platform Rotation Handoff Boundary message thread in
#login-devops
**
Mission: Support the Primary On-Call engineer!
- Acknowledge and work on escalated pages - ACK pages that Primary On-Call is unable to reach in initial 5-minute period
- Override On-Call schedule to act as Primary On-Call if scheduled Primary is unavailable
- Assist with active incidents - Provide additional technical support or offer to take Situation Lead duties
- Help out with excess toil - Assist the Interrupts engineer if necessary
- Offer material and psychological support to Primary - Empathize! Proactively reach out if they have experienced high stress situations or worked over 8 hours without any breaks
- If any incident has occurred in the last 24 hours, check in with Primary On-Call engineer:
- How are they feeling?
- Do they need to pass off Primary for a bit?
Mission: Support the Login.gov Platform's customers!
In addition to the LG Platform: Interrupts board on GitHub, the following identity-devops
wiki pages are helpful for most Interrupts responsibilities:
- Setting Up your Login.gov Infrastructure Configuration
- Setting Up AWS Vault
- Building a Personal Sandbox Environment
- Common Infrastructure Commands and Shortcuts
- IAM Configurations - for on/offboarding AWS IAM users
- Making Changes via Terraform - for troubleshooting Terraform deployment issues
- Watch the
#login-platform-help
Slack channel - Assist users with Platform questions, automation, tools, and application sandbox troubleshooting - Manage the LG Platform: Interrupts board
- Provision new users and remove offboarded users - Self-assign open Onboarding and Offboarding issues in
identity-devops
- Lead AWS onboarding sessions with new users - Attend and lead the bi-weekly AWS Onboarding Time meeting Mondays at 1630 (4:30PM) Eastern Time
- Refine automation/tools - Make things easier, safer, and requiring less context
- Do NOT do project work! - Go mining in our docs for things to fix if you are bored!
Do these as you enter Interrupts:
- Update the
@login-platform-help
Slack group handle - Check in on the LG Platform: Interrupts board
- Check with outgoing Interrupts engineer - Review any notable handoff items
- Make sure any un-provisioned new users are invited to a future AWS Onboarding Time session - This should be done during your rotation!
- Check if anyone needs help in
#login-platform-help
- Immediately disable anyone who has left the program but is still provisioned - Additionally, remove
prod
access for anyone who will be leaving the program within the week - Work the LG Platform: Interrupts board - Update issue Status and add notes as is appropriate
- Host at least one AWS Onboarding Time session if anyone needs to onboard with AWS Access - Issues on the Interrupts board / in
identity-devops
should help you identify new and not yet initialized users
- Make sure the LG Platform: Interrupts board is up to date
- Communicate in-flight work with incoming Interrupts engineer - Review any notable handoff items
- Reflect on your Interrupts rotation experience
- Identify major sources of toil
- Think about investments that could reduce/eliminate toil
Mission: Ship!
- Prepare weekly
identity-devops
release and deploy it following the Weekly Platform Deployments guide
See the Responsibilities above for a link to the full release and deployment process including daily tasks.
- Update the
@login-platform-deployer
Slack group handle
- Communicate any deploy issues with incoming Deployment rotation engineer
- Note any
stages/
branches which required force-pushing (i.e. could not be fast-forwarded) to the newest release tag - Note any environments and/or directory/account combinations that should not be deployed to in the next release, and why
- Note any
Mission: Support GitLab and related automation tools and infrastructure!
Note - This is not currently a rotation. We will reassess our approach to GitLab and automation support in the coming months.
- Respond to problems with GitLab CI/CD
To temporarily take over the Primary or Secondary On-Call schedule:
- Open Platform Team Overrides
- Click "platform_primary"
- Click "Schedule Calendar"
- Click "Create shift override"
- In "Select rotation" select "Platform Primary"
- Select the start and end time of the override
- Select the person who will be taking over the scheduled time.
- Click "Create shift override" to set the override
Engineers on the Platform teams at Login.gov are expected to participate in at least one of the rotation types every 8 weeks starting after their first 60 days on the program. Suggested rotations:
- Interrupts - A great first rotation type for new team members, and a great way to contribute if you are not part of On-Call rotations.
- Deployment - Another good new team member rotation, particularly if you are not part of the On-Call rotations.
- DevTools - Ideal for members of Team Mary. Currently just a group, but this may become a rotation in the future.
- On-Call Primary/Secondary - After time in other rotations, and after preparing as described in Are You Ready To Be On-Call?, those who can are urged to join this rotation.
Before joining the Primary/Secondary On-Call rotation schedules for the Platform team, ensure the following are all true:
- Able to fully access our AWS accounts
- Comfortable with sandbox tasks (Terraform
plan
andapply
, navigating instances) - Comfortable navigating APM and Infrastructure areas in NewRelic
- Comfortable reviewing logs in AWS CloudWatch and/or with
tail-cw
SSM command - Shadowed full set of deploys:
dev
,int
,staging
,dm
, andprod
application deployments, and other platform code (Deployment rotation) - Reviewed [Security Incident Response Guide]({% link _articles/incident-response-guide.md %})
- Reviewed past postmortems
- Joined
#login-situation
channel - Participated in at least one bi-weekly Contingency Plan Training Wargames session
- Participated in at least one "Klaxon" session (if sessions are running)
- Joined
identity-devops
Google Hangout group (in case of Slack outage) - Able to SSM into
prod
EC2 instances - [AWS Incident Manager]({% link _articles/platform-aws-incident-manager.md %}) configured
- Created and tested GSA email IdP account with SMS and PIV enabled in:
int
staging
prod
FEELING READY? You got this!