Fix Certifying Auditee field names in historical data #3402

sambodeme · 2024-02-13T23:54:45Z

While preparing the data migration documentation, it was noted that AUDITEENAME was incorrectly used instead of AUDITEECERTIFYNAME, and AUDITEETITLE was used instead of AUDITEECERTIFYTITLE. It was determined that this will have a low impact on the disseminated reports as it does not affect the financial aspect of the audit reports and only affects the auditee certifying information. However, because this still introduces some data inaccuracies in the reports in production, it must be addressed.

Possible solutions:

Correct the issue in the source records and re-run dissemination for all historic reports.
Fix the issue on the dissemination side (general table) following a change record approach.

The text was updated successfully, but these errors were encountered:

danswick · 2024-04-10T18:12:30Z

My initial reaction is that option 1 seems preferable. Option 2 would be overwritten if we ever had to re-run dissemination for some other reason.

Another possible approach would be to handle both cases during intake-to-dissemination. We could check for both cases and pass the appropriate one on to the disseminated record.

jadudm · 2024-04-11T11:21:18Z

#1 is very expensive in time. Re-running a year of data on a single-core cloud.gov instance takes approximately two weeks.

A third option:

Bring the singleauditreport records in to the production tables.
Add administrative API coverage for that table.
Add an administrative API endpoint that allows for updating records (with appropriate change-management tracking, per @sambodeme 's comment)
Add the ability to re-run dissemination on a record via API.

We could, in this way, begin doing data curation via API, eliminating some of the challenges of trying to do all of this as GH Actions.

I've just suggested an entirely new idea that needs discussion, I think. I'm wrestling with/thinking this way because:

The fact that we need a "source of truth" somehwere, which is our singleauditchecklist. We want the change made there, to avoid the problem @danswick pointed out.
We have a difficult time working with production data, because of our controls. (This is a good thing, but it is something that forces us to do work in lower environments and then manually move data in.)
We have controls built into the admin API that could let us "bless" one or more keys as "write" keys, and in doing so, limit the possibility of "accidents" against the source data.
The API can provide the logging and change tracking we need.
We have no easy way to re-disseminate a record, when it should be a cheap operation/easy to do.
We have multiple other (similar) issues we need to address, and we need to make data curation less expensive. The API moves data curation into the space of stand-alone Python (or, any language) tools that we can all test and validate against lower environments, and run with confidence against production.

Probably a few other things. I have no idea how this would play with the existing migration tooling... it might be a non-starter. But, for ongoing curation work, this might be worth discussing?

danswick · 2024-07-24T17:51:19Z

We'll tackle this as part of the next batch of curation work.

sambodeme · 2024-10-28T15:52:02Z

As I began reviewing the code and scaffolding the necessary logic to fix the auditee name and title (see ticket #3402), I realized that data curation might be needed to address issues with historical records migrated from Census data. This could be due to various reasons, including bugs in the migration algorithm that were not identified at the time and are now surfacing (or may surface in the future). Additionally, there might be a need to update records in the FAC databases, regardless of their origin. This typically occurs when the FAC team modifies intake validation rules, resulting in existing records that no longer validate against the new rules without updates.

When data curation involves historical records, fixing these issues will often require accessing raw data from the historical Census records table and reusing logic from the census_historical_migration app. The reason for reusing this logic is to maintain consistency, such as the way we handled missing values during migration by replacing them with the GSA_MIGRATION placeholder.

This situation raises questions about how and where to organize the data curation code. Should we create a new app (data-curation) within the Django project and consolidate all data curation work there? This approach has the advantage of centralizing all data curation efforts in one place but may lead to the new app becoming too dependent on others, such as the census_historical_migration or audit apps.

Alternatively, should we include a curation section within each app (one for the census_historical_migration app and one for the audit app)? This approach would make the apps more self-contained and loosely coupled, reducing dependencies between them. However, it also means the curation logic would be spread across multiple apps.

jadudm · 2024-10-28T16:01:23Z

Thinking out loud...

The historical_migration code assumes:

The data coming in is formatted as a submission, and
It will need either transformation or permission to go through "incorrectly."

It would be heavyweight, but could all curation be implemented as

Generate a new submission with the updated/fixed data,
Pass it through the migration code

Such that the migration code is, for all intents and purposes, the only place we do this work?

(This is a third option. I haven't thought about how odd or heavyweight it might turn out to be.)

My intuition/assumptions so far have been that having a curation app would be the way to go.

For this particular issue, I've been assuming a management command would have access to both the sac and the census_historical tables. Therefore, the operation is basically:

Loop
2. Visit each SAC
3. Get the correct names from the historical tables
4. with curation_tracking:
1. Make the change in the sac
2. Save the sac
5. Re-disseminate the record

That is, I've been assuming we have 1) the current record and 2) the historical record in hand for all curation work, and therefore each action looks more like a management command that is probably only run once?

sambodeme mentioned this issue Feb 13, 2024

📍 Next steps in data migration #3364

Closed

sambodeme added the eng label Feb 13, 2024

github-project-automation bot added this to FAC Feb 13, 2024

github-project-automation bot moved this to Triage in FAC Feb 13, 2024

danswick moved this from Triage to Available in FAC Apr 10, 2024

danswick changed the title ~~Fix Certifying Auditee Info~~ Fix Certifying Auditee field names in historical data Apr 10, 2024

sambodeme self-assigned this Oct 22, 2024

sambodeme moved this from Backlog to In Progress in FAC Oct 23, 2024

sambodeme linked a pull request Oct 25, 2024 that will close this issue

Added logic to fix auditee name and title #4418

Open

18 tasks

sambodeme added a commit that referenced this issue Oct 30, 2024

#3402 Restructured code and added test

47f5f25

sambodeme added the curation Candidate for data curation label Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Certifying Auditee field names in historical data #3402

Fix Certifying Auditee field names in historical data #3402

sambodeme commented Feb 13, 2024 •

edited

Loading

danswick commented Apr 10, 2024

jadudm commented Apr 11, 2024 •

edited

Loading

danswick commented Jul 24, 2024

sambodeme commented Oct 28, 2024

jadudm commented Oct 28, 2024

Fix Certifying Auditee field names in historical data #3402

Fix Certifying Auditee field names in historical data #3402

Comments

sambodeme commented Feb 13, 2024 • edited Loading

danswick commented Apr 10, 2024

jadudm commented Apr 11, 2024 • edited Loading

danswick commented Jul 24, 2024

sambodeme commented Oct 28, 2024

jadudm commented Oct 28, 2024

sambodeme commented Feb 13, 2024 •

edited

Loading

jadudm commented Apr 11, 2024 •

edited

Loading