Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature #1/inep data scraper #70

Merged
merged 48 commits into from
Sep 4, 2024
Merged

Conversation

johan-rocha
Copy link
Collaborator

@johan-rocha johan-rocha commented Sep 4, 2024

Related Issue

Resolves #1 #42 #44 #46 #67

Description

This pull request addresses several issues related to data processing and web scraping. The key changes involve implementing a web scraping process for educational data from INEP, managing data consistency and checkpointing, generalizing data analysis, and automating database insertion based on specific dates. These changes span multiple modules and files in the DataETL and DataScraper directories.

Proposed Changes

  • DataETL

    • ETLDebugger.ipynb: Updated for debugging new ETL processes.
    • ETLIndicadorPostgres.py: Implemented processes for loading and managing indicators in PostgreSQL.
    • ETLMatriculaPostgres.py: Added functionality to handle matricula data in PostgreSQL.
    • ETLMunicipioJSON.py: Enhanced JSON extraction for municipalities.
    • ETLMunicipioPostgres.py: Added functionality for managing municipality data in PostgreSQL.
    • main.py: Updated to integrate new ETL processes and modules.
    • models/base.py: Updated base models to support new data structures.
    • models/filtro.py: Modified to include new filters for data processing.
    • models/indicador.py: Implemented changes for handling indicator data.
    • models/matricula.py: Updated to reflect new matricula data handling.
    • models/municipio.py: Modified to include new municipality data handling.
    • municipio_id.py: Enhanced to support new municipality ID processing.
    • src/config.py: Updated configuration settings for new ETL processes.
    • src/connect.py: Enhanced connection handling for database interactions.
    • src/etl.py: Updated to reflect new ETL process implementations.
    • src/extractor.py: Modified to include new data extraction methods.
    • src/load.py: Updated to handle new data loading requirements.
    • src/transform.py: Enhanced transformation methods to support new data structures.
  • DataScraper

    • InepScraper.py: Developed a web scraping script to extract educational data from INEP for Minas Gerais and its municipalities.

Tests Performed

  • Validated data collection from INEP against official sources for accuracy and consistency.
  • Tested web scraping script across various scenarios to ensure effectiveness.
  • Conducted performance tests and consistency checks for checkpoint functionality.
  • Verified data processing and generalization between 'matricula' and 'indicador' classes.

Review Checklist

  • Are the changes adequately documented?
  • Have tests been executed and passed successfully?
  • Are the changes aligned with the related issue?

@johan-rocha johan-rocha added design modificações de design enhancement Modificações gerais feature Nova funcionalidade performance Melhoria de performance labels Sep 4, 2024
@johan-rocha johan-rocha added this to the Sprint #15 milestone Sep 4, 2024
@johan-rocha johan-rocha self-assigned this Sep 4, 2024
@rafgpereira rafgpereira merged commit b556943 into develop Sep 4, 2024
1 check passed
Copy link

codecov bot commented Sep 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (135dbd2) to head (cb82285).
Report is 49 commits behind head on develop.

Additional details and impacted files
@@            Coverage Diff            @@
##           develop       #70   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            3         3           
  Lines          147       147           
  Branches        39        39           
=========================================
  Hits           147       147           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design modificações de design enhancement Modificações gerais feature Nova funcionalidade performance Melhoria de performance
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants