Skip to content

Commit

Permalink
Updating operator guide w/ design decisions and runbook (#19)
Browse files Browse the repository at this point in the history
* Updating operator guide w/ design decisions and runbook

* merge operator guide, update README

---------

Co-authored-by: chrisghill <[email protected]>
  • Loading branch information
coryodaniel and chrisghill authored Aug 23, 2024
1 parent a120f09 commit 6768588
Show file tree
Hide file tree
Showing 2 changed files with 87 additions and 10 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Amazon Aurora is a fully managed relational database engine that's compatible wi

## Design

For detailed information, check out our [Operator Guide](operator.mdx) for this bundle.
For detailed information, check out our [Operator Guide](operator.md) for this bundle.

## Usage

Expand Down
95 changes: 86 additions & 9 deletions operator.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,94 @@ Aurora is part of the managed database service Amazon Relational Database Servic
* For applications that dont use load balanced reader, the writer endpoint can be read from
* Minimum retention period for backups is 1 day, as they [cannot be disabled in Aurora](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Backups.html)

## Caveats
## Runbook

### Connection Issues

If unable to connect to the Aurora PostgreSQL cluster:

Check the cluster's current status and endpoint information:

```sh
aws rds describe-db-clusters --query "DBClusters[?DBClusterIdentifier=='<cluster_identifier>'].[Status, Endpoint, ReaderEndpoint]" --output table
```

> Expect to see the status of the cluster along with the primary and reader endpoints.
Verify the security group rules to ensure proper ingress rules are set up:

```sh
aws ec2 describe-security-groups --group-ids <security_group_id> --query "SecurityGroups[*].[GroupId, IpPermissions]" --output table
```

> Confirm that the ingress rules allow traffic from your IP or subnet.
### High Latency Queries

If queries are running slow, use the following commands to identify problematic queries:

Connect to your PostgreSQL instance and check for slow queries:

```sql
SELECT query, state, waiting, query_start
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY query_start DESC;
```

> Look for queries that have been running for a long time and investigate their execution plans.
Enable and review PostgreSQL's slow query log:

```sql
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Logs queries that take longer than 1000ms
SELECT pg_reload_conf();
```

> This will log slow queries to help identify and optimize them.
### Backup Verification

Ensure your backups are being created and managed as expected.

List the available snapshots for your Aurora PostgreSQL cluster:

```sh
aws rds describe-db-cluster-snapshots --db-cluster-identifier <cluster_identifier> --query "DBClusterSnapshots[].[DBClusterSnapshotIdentifier, SnapshotCreateTime]" --output table
```

> Verify that snapshots are created according to your backup policy.
Check backup retention settings:

```sh
aws rds describe-db-clusters --db-cluster-identifier <cluster_identifier> --query "DBClusters[0].[BackupRetentionPeriod]" --output table
```

> Ensure that the retention period is set according to your organization's policy.
### Disk Space Usage

Monitor and manage the disk space usage for your Aurora PostgreSQL cluster.

Check the current disk space usage metrics:

```sh
aws cloudwatch get-metric-statistics --namespace "AWS/RDS" --metric-name "FreeStorageSpace" --dimensions Name=DBClusterIdentifier,Value=<cluster_identifier> --statistics Average --period 300 --start-time $(date -u -d '1 hour ago' +"%Y-%m-%dT%H:%M:%SZ") --end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ")
```

> Monitor the free storage space to ensure you do not run out of disk space.
Reclaiming disk space in PostgreSQL:

```sql
VACUUM;
VACUUM FULL; -- This might lock tables, use it during maintenance windows
REINDEX DATABASE your_database_name;
```

> Regular maintenance tasks like vacuum and reindex help to reclaim space and improve performance.
* IAM Authentication is *not* implemented, but on our roadmap. Please add a comment/thumbs up on this [issue](https://github.com/massdriver-cloud/aws-aurora-postgresql/issues/4) and we will prioritize.
* RDS Proxy is *not* implemented, but on our roadmap. Please add a comment/thumbs up on this [issue](https://github.com/massdriver-cloud/aws-aurora-postgresql/issues/3) and we will prioritize.
* Backup Plans are *not* implemented, but on our roadmap. Please add a comment/thumbs up on this [issue](https://github.com/massdriver-cloud/aws-aurora-postgresql/issues/5) and we will prioritize.
* [Custom endpoints](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Overview.Endpoints.html#Aurora.Endpoints.Cluster) aren't currently on our roadmap. Please open an [issue](https://github.com/massdriver-cloud/aws-aurora-postgresql/issues) if you need support for this.
* Cluster role associations aren't currently on our roadmap. Please open an [issue](https://github.com/massdriver-cloud/aws-aurora-postgresql/issues) if you need support for this.
* Automatic minor version upgrades are disabled. Please open an [issue](https://github.com/massdriver-cloud/aws-aurora-postgresql/issues) if you need support for this.
* No support for Aurora Global. Please open an [issue](https://github.com/massdriver-cloud/aws-aurora-postgresql/issues) if you need support for this.
* No support for non-Aurora PostgreSQL. Please open an [issue](https://github.com/massdriver-cloud/aws-aurora-postgresql/issues) if you need support for this.


## Links
Expand Down

0 comments on commit 6768588

Please sign in to comment.