DRAFT Read replication and moving the API to a second database #4427

jadudm · 2024-11-01T19:21:53Z

Still in progress. This message will be removed, and the PR moved out of draft, when it is ready. Saving incrementally.

This PR represents work from @asteel-gsa , @rnovak338 , @jadudm , and has benefited greatly from conversation with many members of the team.

This is a large PR.

DB1 = fac-db
DB2 = fac-snapshot-db

Changes

The API is removed from DB1
The old Census data used for cog/over assignments is removed, as we now use our internal/cleaned data for this purpose (and have for a long time)
Advanced search now points at DB2
A stateless/reproduceable data pipeline is introduced in DB2 (regenerated nightly via GH Action)
api_v1_0_3 is removed, as it was suppressing all Tribal data, and therefore arguably incorrect.
admni_api_v... is removed. No administrative access happens via the API. This also removes an entire access control mechanism. 🎉 (A good thing for maintenance, compliance.)
api_v1_1_0 is the new default API.
api_v2_0_0 is introduced
Separate tables on separate schemas are introduced as part of the pipeline: public_data_v1_0_0 and suppressed_data_v1_0_0 are the schemas. Only data that is public is in the public data, and only data that is suppressed (e.g. audits from Tribes/Tribal Organizations that have opted to suppress their data) is on the suppressed schema.
api_v2_0_0 uses no access control on data in the public schema, and applies access control to the data in the suppressed schema.
A new mechanism for adding/removing privileged users in the Django Admin dashboard is introduced. It is now managed via a static JSON document in the config directory, and is handled via onboarding/offboarding just like the cloud.gov access.
A new mechanism for adding/removing access to suppressed reports for Federal users is introduced. This is now handled via the Django admin panel.
A new tool for loading bulk data is offered; it is entirely for local/dev use.
sling is introduced into the environment, which is a tool for moving data from DB->DB, DB->S3, S3->DB, etc. A pilot/prototype for generating full dumps of the database to CSV (direct to S3) is provided. It is not wired in; it would need to be added to the nightly (or weekly, or...) GH Action, and further work to add those files to the Django app (to provide access control, and possibly to limit to Federal users) is required. However, it has been tested locally and in preview to validate that the tooling works and is ready for inclusion in the data pipeline. (It only dumps tables from the public_data schema, thus guaranteeing that no suppressed data is processed.)

Additional documentation about the architectural changes is in the jadudm/api-perf branch (SQL_README.md.

Future work/possibilities

The data pipeline suggests that we could now generate Excel documents (e.g. SF-SACs) off of the known public data in public_data as opposed to doing filtering-while-generating from the internal tables.

Testing

First, pull the branch and do

make docker-full-clean ; make docker-first-run

followed by a compose up.

Watch the boot sequence

The boot sequence simulates both the daily boot sequence in production as well as the nightly GH Action work. That is, it runs all of the SQL involved in the data pipeline against DB2, and then it runs pre actions on DB1 (formerly api_teardown), and then migrations, and then post actions on DB1 (formerly api_standup). The pre/post on DB1 is tiny compared to what it was before. The actions on DB2 are larger than what we used to have in api_standup/api_teardown, but that is because we're adding new/improving.

If things don't come up, it might be worth dropping a note to the dev channel with an @jadudm . This should be a clean sequence, as tested both via local development as well as in preview. This PR does not introduce any changes to the DB1 data models, so there should be no data migration issues. (Is that true? @rnovak338 ? We might introduce a model change in an admin table...) No changes to the singleauditchecklist or the dissemination_ tables on DB1.

Confirm the app works

Try doing both a basic and advanced search. If both work (and probably return zero results on an empty DB), that means you should have an app that is hitting DB1 for basic search, and DB2 for advanced search.

Load a lot of data

In backend/util/load_public_dissem_data is a tool to help pre-populate your database with data. There is a README in that directory with detailed instructions. In short:

Grab the bulk data from Gdrive. It is linked from the README. (This is data public data that is already disseminated in the FAC. It has 100 records that are marked as is_public=false, but those are actually public records. In other words, it is all public data, and some of it is intentionally marked as suppressed for testing purposes.)
make build and make run the container.

You will need your full FAC stack running, and there is a note in the README about making sure that you are on the same virtual docker network. (That is, the containers all need to talk to each other.)

The data loading will take a few minutes. It is loading the dissemination_ tables in DB1 with around 300K general records and 4M federal_award records (and a bunch of other data in the other tables, too).

Run the FAC again

Once you have loaded the data, bring your FAC stack down (docker compose down), and back up (docker compose up). This time, the simulation will take a lot longer. It will

Copy all of the data from DB1
Create a copy of that
Create public and suppressed tables
Stand up the API(s)
Build a lot of indexes

The SQL_README has more information on that process. The indexing can be particularly slow on some machines.

Once the FAC is back up, you should be able to:

Do basic searches
Do advanced searches. You should find that an "all years" search with no parameters takes around 5-10 seconds, and an all-years search with ALN 93 will go much quicker. Refining the search further should generally increase the search performance.

Test the API

Once you have data, testing the API is interesting/possible.

In backend/sql is test_api.py. (I don't know where it would be better to put this file. It can be moved.)

You should have, in your environment:

FAC_API_KEY, which, in truth, does not matter for local tests... but will if you want to test against preview
CYPRESS_API_GOV_USER_ID, which is the same ID you would have available for E2E testing with Cypress
CYPRESS_API_GOV_JWT, which (again) you should already have from E2E tests

Then, run the tests.

CAN_READ_SUPPRESSED=0 pytest -s --env local test_api.py

will run the test script by passing the env variable to all tests (with it set to "local"). The environment variable on the command line, CAN_READ_SUPPRESSED, means that the key you are using does not have Tribal data access. Therefore, it should fail when you attempt to access the api_v2_0_0 endpoints that deal with suppressed data.

If you set CAN_READ_SUPPRESSED=1

CAN_READ_SUPPRESSED=1 pytest -s --env local test_api.py

you should be able to run again and get failing tests. That is because your key is not approved to access suppressed data!

Getting the expected results from local testing is a good thing, even if they are failing tests. Now, you want to grant tribal access to your key, and see if the tests change.

Testing in `preview`

The script can also be used to test in preview.

Preview currently has the large/public data set loaded, with the faked suppressed data. (PREVIEW DOES NOT CONTAIN SUPPRESSED DATA. IT CONTAINS FAKE SUPPRESSED DATA.) By changing the environment to preview, and still setting the env var appropriately, it is possible to run the pytest against the "production" environment, and confirm that access controls are in place there.

You will need to grant access to your key in preview, which may require adding yourself to a file and restarting the app (so that the config change is read). That, or edit the DB tables directly in order to affect the change.

Grant access to your key

In order to grant access to your key, you can either use the new admin panel, or edit the database directly.

To use the admin panel [RN EDIT HERE]...

Performance testing

There is an api_perf script floating around; it is relatively easy to inspect/modify and use against either the local stack or the preview site. It was primarily used to evaluate early design decisions in the API. It might be useful as another kind of check.

PR Checklist: Submitter

PR Checklist: Reviewer

Pull the branch to your local environment and run make docker-clean; make docker-first-run && docker compose up; then run docker compose exec web /bin/bash -c "python manage.py test"
Manually test out the changes locally, or check this box to verify that it wasn’t applicable in this case.
Check that the PR has appropriate tests. Look out for changes in HTML/JS/JSON Schema logic that may need to be captured in Python tests even though the logic isn’t in Python.
Verify that no Git surgery is necessary at any point (such as during a merge party), or, if it was, repeat the testing after it’s finished.

The larger the PR, the stricter we should be about these points.

Pre Merge Checklist: Merger

Ensure that prior to approving, the terraform plan is what we expect it to be. -/+ resource "null_resource" "cors_header" should be destroying and recreating its self and ~ resource "cloudfoundry_app" "clamav_api" might be updating its sha256 for the fac-file-scanner and fac-av-${ENV} by default.
Ensure that the branch is up to date with main.
Ensure that a terraform plan has been recently generated for the pull request.

This runs a full install sequence. Ready for testing on preview.

Adds the access tables to its own startup. Admin API relies on it, or it won't start.

How did I miss those?

We should be able to inspect more tables, for debugging.

This brings up the full API, and is ready for performance testing in `preview`.

It is cleaner than it was, and may make it easier to debug the `preview` deploy.

Adds grabbing sling into the mix

Testing again

Also, added partitions. Partitions are 3x faster for downloading all the data (by year), but 10x slower than batches (via EXPLAIN). In the real, it is 2x slower to download than batches, but 3x faster than a straight download. So. Probably worth it, if we can document that people downloading all the data should either 1) do so year-by-year, or 2) use batches.

And, redeploying...

Need to just return a non-zero exit code...

Need to *always* succeed...

:wq

github-actions · 2024-11-01T19:22:40Z

Terraform plan for meta

No changes. Your infrastructure matches the configuration.

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

📝 Plan generated in Pull Request Checks #3931

github-actions · 2024-11-01T19:22:46Z

Terraform plan for dev

Plan: 2 to add, 1 to change, 2 to destroy.

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
!~  update in-place
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # module.dev.cloudfoundry_app.postgrest will be updated in-place
!~  resource "cloudfoundry_app" "postgrest" {
!~      environment                     = (sensitive value)
        id                              = "5de6f04c-52e0-400f-961a-16243b1c7c9a"
!~      id_bg                           = "************************************" -> (known after apply)
        name                            = "postgrest"
#        (16 unchanged attributes hidden)

#        (1 unchanged block hidden)
    }

  # module.dev.cloudfoundry_service_key.postgrest must be replaced
-/+ resource "cloudfoundry_service_key" "postgrest" {
!~      credentials      = (sensitive value)
!~      id               = "************************************" -> (known after apply)
        name             = "postgrest"
!~      service_instance = "b036f306-5950-4078-9309-cfda6ed03482" -> "86a11021-7922-411b-b0ca-4341b7a0b911" # forces replacement
    }

  # module.dev.module.cors.null_resource.cors_header must be replaced
-/+ resource "null_resource" "cors_header" {
!~      id       = "*******************" -> (known after apply)
!~      triggers = { # forces replacement
!~          "always_run" = "2024-11-13T21:32:16Z" -> (known after apply)
        }
    }

Plan: 2 to add, 1 to change, 2 to destroy.

📝 Plan generated in Pull Request Checks #3931

Added max timeout to request

Now uses Django factory testing (like we do with our other tests that fall under `manage.py test`) for the tests that were written up by Matt. INCOMPLETE - check the "FIXME" comment. Currently the environment is hardcoded to "local" under the `ApiTests` class, and I haven't gotten a "success" response from the following tests. - `test_suppressed_not_accessible_with_bad_key` - `test_suppressed_accessible_with_good_key`

rnovak338 · 2024-11-08T00:48:57Z

Adding some notes where I've left off from this PR today, if this gets picked up before I come back.

Before we perform any review for move this PR forward, we need to finish cleaning up the linting and the failed django tests.

Bandit is reporting a few problems that just need a simple tweak to fix, but there are 2 more critical vulnerabilities for SQL injection that it is reporting as well.

Also, the python and accessibility tests are failing, I haven't had much time to crank through these.

gsa-suk · 2024-11-21T20:43:37Z

backend/dissemination/sql/fac-snapshot-db/post/143_api_v2_0_0_views.sql

+CREATE VIEW api_v2_0_0.combined AS
+    SELECT * FROM public_data_v1_0_0.combined comb
+    ;
+


census_gsa_crosswalk view is missing here. This needs to be added.

Apologies; I still didn't get to that.

You could add it and do a PR against the codebase. However, it also wants to be added in the .sql in the... fac-snapshot-db/post folder.

Actually, I'd be happy to pair (Monday?) for 30m to help add this in. It would be faster, I suspect, for both of us.

jadudm and others added 30 commits September 27, 2024 09:45

Updating with new layout

f1e4a70

Updating/cleaning up for testing

40158bb

Merge branch 'main' into jadudm/api-perf

660b98e

In-progress.

b2263e7

Updating ignores...

d54e8fa

Runs a ffull sequence

7e54577

This runs a full install sequence. Ready for testing on preview.

Cleaining up

6e4470e

Updating .profile to match run.sh

70e2a13

Bringup for the Admin API

a31eea0

Adds the access tables to its own startup. Admin API relies on it, or it won't start.

Cleanup, missed files.

b3e8523

These are important...

e4fa18f

How did I miss those?

Adding GH action, script

de6c7ea

Expand the admin API to increase visibility

b30ba03

We should be able to inspect more tables, for debugging.

Full-up, with performance testing

1337097

This brings up the full API, and is ready for performance testing in `preview`.

Updates for live test.

65f7365

Improving/simplifying

1be0b83

It is cleaner than it was, and may make it easier to debug the `preview` deploy.

Ready for testing in preview

a809bea

Adds grabbing sling into the mix

Source the script...

88bc8a5

Testing again

Fix path to sling

0e4f240

Making the checks noisy.

a94598c

Noisyer.

2006d8a

Don't exit so quickly...

b95cb61

Try to catch the error and continue if found

6f1d153

Before further renaming.

8c2a047

Forgot to delete a bunch of things...

eb12bad

And, redeploying...

Set +/-e around check...

d5d4915

Trying to get past error...

6b45c49

Need to just return a non-zero exit code...

Fixing SQL checks

328e9b9

Need to *always* succeed...

Still... fixing return values...

3539ffc

asteel-gsa and others added 4 commits November 1, 2024 10:43

Update api refresh workflow

2d009d6

Fix workflows

e569dc6

Fixing two missing tables...

fc614dd

Updating tests.

ab68fa0

:wq

jadudm temporarily deployed to dev November 1, 2024 19:22 — with GitHub Actions Inactive

jadudm temporarily deployed to meta November 1, 2024 19:22 — with GitHub Actions Inactive

rnovak338 added 3 commits November 7, 2024 13:26

Linting

24c2c5c

Document django admin access

8815dd2

Merge branch 'main' into jadudm/api-perf

04a85b8

rnovak338 temporarily deployed to dev November 7, 2024 19:42 — with GitHub Actions Inactive

rnovak338 temporarily deployed to meta November 7, 2024 19:42 — with GitHub Actions Inactive

Linting

f2084e9

rnovak338 temporarily deployed to dev November 7, 2024 19:52 — with GitHub Actions Inactive

rnovak338 temporarily deployed to meta November 7, 2024 19:52 — with GitHub Actions Inactive

rnovak338 added 2 commits November 7, 2024 19:35

More linting - bandit

b04e944

Added max timeout to request

rnovak338 temporarily deployed to dev November 8, 2024 00:38 — with GitHub Actions Inactive

rnovak338 temporarily deployed to meta November 8, 2024 00:38 — with GitHub Actions Inactive

FE Linting

015af34

rnovak338 temporarily deployed to dev November 8, 2024 00:46 — with GitHub Actions Inactive

rnovak338 temporarily deployed to meta November 8, 2024 00:46 — with GitHub Actions Inactive

rnovak338 added 2 commits November 14, 2024 11:23

Merge branch 'main' into jadudm/api-perf

d1a17ae

Linting

04e19e4

rnovak338 temporarily deployed to dev November 14, 2024 21:49 — with GitHub Actions Inactive

rnovak338 temporarily deployed to meta November 14, 2024 21:49 — with GitHub Actions Inactive

gsa-suk reviewed Nov 21, 2024

View reviewed changes

rnovak338 mentioned this pull request Nov 22, 2024

Django Admin -> Changes to replace Admin API #4473

Merged

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT Read replication and moving the API to a second database #4427

DRAFT Read replication and moving the API to a second database #4427

jadudm commented Nov 1, 2024 •

edited

Loading

github-actions bot commented Nov 1, 2024 •

edited

Loading

github-actions bot commented Nov 1, 2024 •

edited

Loading

rnovak338 commented Nov 8, 2024

gsa-suk Nov 21, 2024

jadudm Nov 21, 2024

DRAFT Read replication and moving the API to a second database #4427

Are you sure you want to change the base?

DRAFT Read replication and moving the API to a second database #4427

Conversation

jadudm commented Nov 1, 2024 • edited Loading

Changes

Future work/possibilities

Testing

Watch the boot sequence

Confirm the app works

Load a lot of data

Run the FAC again

Test the API

Testing in preview

Grant access to your key

Performance testing

PR Checklist: Submitter

PR Checklist: Reviewer

Pre Merge Checklist: Merger

github-actions bot commented Nov 1, 2024 • edited Loading

github-actions bot commented Nov 1, 2024 • edited Loading

rnovak338 commented Nov 8, 2024

gsa-suk Nov 21, 2024

Choose a reason for hiding this comment

jadudm Nov 21, 2024

Choose a reason for hiding this comment

jadudm commented Nov 1, 2024 •

edited

Loading

Testing in `preview`

github-actions bot commented Nov 1, 2024 •

edited

Loading

github-actions bot commented Nov 1, 2024 •

edited

Loading