Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT Read replication and moving the API to a second database #4427

Draft
wants to merge 96 commits into
base: main
Choose a base branch
from

Conversation

jadudm
Copy link
Contributor

@jadudm jadudm commented Nov 1, 2024

Still in progress. This message will be removed, and the PR moved out of draft, when it is ready. Saving incrementally.

This PR represents work from @asteel-gsa , @rnovak338 , @jadudm , and has benefited greatly from conversation with many members of the team.

This is a large PR.

DB1 = fac-db
DB2 = fac-snapshot-db

Changes

  • The API is removed from DB1
  • The old Census data used for cog/over assignments is removed, as we now use our internal/cleaned data for this purpose (and have for a long time)
  • Advanced search now points at DB2
  • A stateless/reproduceable data pipeline is introduced in DB2 (regenerated nightly via GH Action)
  • api_v1_0_3 is removed, as it was suppressing all Tribal data, and therefore arguably incorrect.
  • admni_api_v... is removed. No administrative access happens via the API. This also removes an entire access control mechanism. 🎉 (A good thing for maintenance, compliance.)
  • api_v1_1_0 is the new default API.
  • api_v2_0_0 is introduced
  • Separate tables on separate schemas are introduced as part of the pipeline: public_data_v1_0_0 and suppressed_data_v1_0_0 are the schemas. Only data that is public is in the public data, and only data that is suppressed (e.g. audits from Tribes/Tribal Organizations that have opted to suppress their data) is on the suppressed schema.
  • api_v2_0_0 uses no access control on data in the public schema, and applies access control to the data in the suppressed schema.
  • A new mechanism for adding/removing privileged users in the Django Admin dashboard is introduced. It is now managed via a static JSON document in the config directory, and is handled via onboarding/offboarding just like the cloud.gov access.
  • A new mechanism for adding/removing access to suppressed reports for Federal users is introduced. This is now handled via the Django admin panel.
  • A new tool for loading bulk data is offered; it is entirely for local/dev use.
  • sling is introduced into the environment, which is a tool for moving data from DB->DB, DB->S3, S3->DB, etc. A pilot/prototype for generating full dumps of the database to CSV (direct to S3) is provided. It is not wired in; it would need to be added to the nightly (or weekly, or...) GH Action, and further work to add those files to the Django app (to provide access control, and possibly to limit to Federal users) is required. However, it has been tested locally and in preview to validate that the tooling works and is ready for inclusion in the data pipeline. (It only dumps tables from the public_data schema, thus guaranteeing that no suppressed data is processed.)

Additional documentation about the architectural changes is in the jadudm/api-perf branch (SQL_README.md.

Future work/possibilities

  • The data pipeline suggests that we could now generate Excel documents (e.g. SF-SACs) off of the known public data in public_data as opposed to doing filtering-while-generating from the internal tables.

Testing

First, pull the branch and do

make docker-full-clean ; make docker-first-run

followed by a compose up.

Watch the boot sequence

The boot sequence simulates both the daily boot sequence in production as well as the nightly GH Action work. That is, it runs all of the SQL involved in the data pipeline against DB2, and then it runs pre actions on DB1 (formerly api_teardown), and then migrations, and then post actions on DB1 (formerly api_standup). The pre/post on DB1 is tiny compared to what it was before. The actions on DB2 are larger than what we used to have in api_standup/api_teardown, but that is because we're adding new/improving.

If things don't come up, it might be worth dropping a note to the dev channel with an @jadudm . This should be a clean sequence, as tested both via local development as well as in preview. This PR does not introduce any changes to the DB1 data models, so there should be no data migration issues. (Is that true? @rnovak338 ? We might introduce a model change in an admin table...) No changes to the singleauditchecklist or the dissemination_ tables on DB1.

Confirm the app works

Try doing both a basic and advanced search. If both work (and probably return zero results on an empty DB), that means you should have an app that is hitting DB1 for basic search, and DB2 for advanced search.

Load a lot of data

In backend/util/load_public_dissem_data is a tool to help pre-populate your database with data. There is a README in that directory with detailed instructions. In short:

  1. Grab the bulk data from Gdrive. It is linked from the README. (This is data public data that is already disseminated in the FAC. It has 100 records that are marked as is_public=false, but those are actually public records. In other words, it is all public data, and some of it is intentionally marked as suppressed for testing purposes.)
  2. make build and make run the container.

You will need your full FAC stack running, and there is a note in the README about making sure that you are on the same virtual docker network. (That is, the containers all need to talk to each other.)

The data loading will take a few minutes. It is loading the dissemination_ tables in DB1 with around 300K general records and 4M federal_award records (and a bunch of other data in the other tables, too).

Run the FAC again

Once you have loaded the data, bring your FAC stack down (docker compose down), and back up (docker compose up). This time, the simulation will take a lot longer. It will

  1. Copy all of the data from DB1
  2. Create a copy of that
  3. Create public and suppressed tables
  4. Stand up the API(s)
  5. Build a lot of indexes

The SQL_README has more information on that process. The indexing can be particularly slow on some machines.

Once the FAC is back up, you should be able to:

  1. Do basic searches
  2. Do advanced searches. You should find that an "all years" search with no parameters takes around 5-10 seconds, and an all-years search with ALN 93 will go much quicker. Refining the search further should generally increase the search performance.

Test the API

Once you have data, testing the API is interesting/possible.

In backend/sql is test_api.py. (I don't know where it would be better to put this file. It can be moved.)

You should have, in your environment:

  • FAC_API_KEY, which, in truth, does not matter for local tests... but will if you want to test against preview
  • CYPRESS_API_GOV_USER_ID, which is the same ID you would have available for E2E testing with Cypress
  • CYPRESS_API_GOV_JWT, which (again) you should already have from E2E tests

Then, run the tests.

CAN_READ_SUPPRESSED=0 pytest -s --env local test_api.py

will run the test script by passing the env variable to all tests (with it set to "local"). The environment variable on the command line, CAN_READ_SUPPRESSED, means that the key you are using does not have Tribal data access. Therefore, it should fail when you attempt to access the api_v2_0_0 endpoints that deal with suppressed data.

If you set CAN_READ_SUPPRESSED=1

CAN_READ_SUPPRESSED=1 pytest -s --env local test_api.py

you should be able to run again and get failing tests. That is because your key is not approved to access suppressed data!

Getting the expected results from local testing is a good thing, even if they are failing tests. Now, you want to grant tribal access to your key, and see if the tests change.

Testing in preview

The script can also be used to test in preview.

Preview currently has the large/public data set loaded, with the faked suppressed data. (PREVIEW DOES NOT CONTAIN SUPPRESSED DATA. IT CONTAINS FAKE SUPPRESSED DATA.) By changing the environment to preview, and still setting the env var appropriately, it is possible to run the pytest against the "production" environment, and confirm that access controls are in place there.

You will need to grant access to your key in preview, which may require adding yourself to a file and restarting the app (so that the config change is read). That, or edit the DB tables directly in order to affect the change.

Grant access to your key

In order to grant access to your key, you can either use the new admin panel, or edit the database directly.

To use the admin panel [RN EDIT HERE]...

Performance testing

There is an api_perf script floating around; it is relatively easy to inspect/modify and use against either the local stack or the preview site. It was primarily used to evaluate early design decisions in the API. It might be useful as another kind of check.

PR Checklist: Submitter

  • Link to an issue if possible. If there’s no issue, describe what your branch does. Even if there is an issue, a brief description in the PR is still useful.
  • List any special steps reviewers have to follow to test the PR. For example, adding a local environment variable, creating a local test file, etc.
  • For extra credit, submit a screen recording like this one.
  • Make sure you’ve merged main into your branch shortly before creating the PR. (You should also be merging main into your branch regularly during development.)
  • Make sure you’ve accounted for any migrations. When you’re about to create the PR, bring up the application locally and then run git status | grep migrations. If there are any results, you probably need to add them to the branch for the PR. Your PR should have only one new migration file for each of the component apps, except in rare circumstances; you may need to delete some and re-run python manage.py makemigrations to reduce the number to one. (Also, unless in exceptional circumstances, your PR should not delete any migration files.)
  • Make sure that whatever feature you’re adding has tests that cover the feature. This includes test coverage to make sure that the previous workflow still works, if applicable.
  • Make sure the full-submission.cy.js Cypress test passes, if applicable.
  • Do manual testing locally. Our tests are not good enough yet to allow us to skip this step. If that’s not applicable for some reason, check this box.
  • Verify that no Git surgery was necessary, or, if it was necessary at any point, repeat the testing after it’s finished.
  • Once a PR is merged, keep an eye on it until it’s deployed to dev, and do enough testing on dev to verify that it deployed successfully, the feature works as expected, and the happy path for the broad feature area (such as submission) still works.
  • Ensure that prior to merging, the working branch is up to date with main and the terraform plan is what you expect.

PR Checklist: Reviewer

  • Pull the branch to your local environment and run make docker-clean; make docker-first-run && docker compose up; then run docker compose exec web /bin/bash -c "python manage.py test"
  • Manually test out the changes locally, or check this box to verify that it wasn’t applicable in this case.
  • Check that the PR has appropriate tests. Look out for changes in HTML/JS/JSON Schema logic that may need to be captured in Python tests even though the logic isn’t in Python.
  • Verify that no Git surgery is necessary at any point (such as during a merge party), or, if it was, repeat the testing after it’s finished.

The larger the PR, the stricter we should be about these points.

Pre Merge Checklist: Merger

  • Ensure that prior to approving, the terraform plan is what we expect it to be. -/+ resource "null_resource" "cors_header" should be destroying and recreating its self and ~ resource "cloudfoundry_app" "clamav_api" might be updating its sha256 for the fac-file-scanner and fac-av-${ENV} by default.
  • Ensure that the branch is up to date with main.
  • Ensure that a terraform plan has been recently generated for the pull request.

jadudm and others added 30 commits September 27, 2024 09:45
This runs a full install sequence. Ready for testing on preview.
Adds the access tables to its own startup. Admin API relies on it, or it
won't start.
How did I miss those?
We should be able to inspect more tables, for debugging.
This brings up the full API, and is ready for performance testing in
`preview`.
It is cleaner than it was, and may make it easier to debug the `preview`
deploy.
Adds grabbing sling into the mix
Testing again
Also, added partitions.

Partitions are 3x faster for downloading all the data (by year), but 10x
slower than batches (via EXPLAIN). In the real, it is 2x slower to
download than batches, but 3x faster than a straight download.

So. Probably worth it, if we can document that people downloading all
the data should either 1) do so year-by-year, or 2) use batches.
Need to just return a non-zero exit code...
Need to *always* succeed...
Copy link
Contributor

github-actions bot commented Nov 1, 2024

Terraform plan for meta

No changes. Your infrastructure matches the configuration.
No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

📝 Plan generated in Pull Request Checks #3931

Copy link
Contributor

github-actions bot commented Nov 1, 2024

Terraform plan for dev

Plan: 2 to add, 1 to change, 2 to destroy.
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
!~  update in-place
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # module.dev.cloudfoundry_app.postgrest will be updated in-place
!~  resource "cloudfoundry_app" "postgrest" {
!~      environment                     = (sensitive value)
        id                              = "5de6f04c-52e0-400f-961a-16243b1c7c9a"
!~      id_bg                           = "************************************" -> (known after apply)
        name                            = "postgrest"
#        (16 unchanged attributes hidden)

#        (1 unchanged block hidden)
    }

  # module.dev.cloudfoundry_service_key.postgrest must be replaced
-/+ resource "cloudfoundry_service_key" "postgrest" {
!~      credentials      = (sensitive value)
!~      id               = "************************************" -> (known after apply)
        name             = "postgrest"
!~      service_instance = "b036f306-5950-4078-9309-cfda6ed03482" -> "86a11021-7922-411b-b0ca-4341b7a0b911" # forces replacement
    }

  # module.dev.module.cors.null_resource.cors_header must be replaced
-/+ resource "null_resource" "cors_header" {
!~      id       = "*******************" -> (known after apply)
!~      triggers = { # forces replacement
!~          "always_run" = "2024-11-13T21:32:16Z" -> (known after apply)
        }
    }

Plan: 2 to add, 1 to change, 2 to destroy.

📝 Plan generated in Pull Request Checks #3931

Added max timeout to request
Now uses Django factory testing (like we do with our other tests that fall under `manage.py test`) for the tests that were written up by Matt.
INCOMPLETE - check the "FIXME" comment. Currently the environment is hardcoded to "local" under the `ApiTests` class, and I haven't gotten a "success" response from the following tests.
- `test_suppressed_not_accessible_with_bad_key`
- `test_suppressed_accessible_with_good_key`
@rnovak338
Copy link
Contributor

Adding some notes where I've left off from this PR today, if this gets picked up before I come back.

Before we perform any review for move this PR forward, we need to finish cleaning up the linting and the failed django tests.

Bandit is reporting a few problems that just need a simple tweak to fix, but there are 2 more critical vulnerabilities for SQL injection that it is reporting as well.

Also, the python and accessibility tests are failing, I haven't had much time to crank through these.

CREATE VIEW api_v2_0_0.combined AS
SELECT * FROM public_data_v1_0_0.combined comb
;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

census_gsa_crosswalk view is missing here. This needs to be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies; I still didn't get to that.

You could add it and do a PR against the codebase. However, it also wants to be added in the .sql in the... fac-snapshot-db/post folder.

Actually, I'd be happy to pair (Monday?) for 30m to help add this in. It would be faster, I suspect, for both of us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants