Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

98 hivemind refactor GitHub issue extraction functions to object orientated code #107

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

JacobCWBillings
Copy link

@JacobCWBillings JacobCWBillings commented Apr 4, 2024

Summary by CodeRabbit

  • Refactor
    • Enhanced the GitHub issues extraction process by introducing a more structured approach with the GithubExtraction and GithubIssueExtraction classes.
    • Improved testing methodology for GitHub issues fetching to align with the new extraction classes.
    • Restructured imports and method calls for improved code organization and readability.
    • Added a new class GitHubIssueID for handling specific attributes related to GitHub issues.
    • Updated test cases for fetching GitHub issue IDs and raw issues to accommodate changes in the extraction process.

Copy link
Contributor

coderabbitai bot commented Apr 4, 2024

Walkthrough

The update shifts the GitHub ETL process towards object-oriented design, introducing new classes for extracting GitHub data like issues and commits. This restructuring enhances code organization and maintainability, with functions now encapsulated within specific classes for better structure and testability.

Changes

Files Change Summary
github_etl.py, test_github_etl_fetch_issues.py Replaced fetch_issues with methods from GithubExtraction class; test updates.
extract/__init__.py, extract/issues.py Introduced GithubIssueExtraction class and added issue processing methods.
test_github_etl_fetch_raw_issues.py Refactored tests to utilize GithubExtraction class for fetching raw issues.

Possibly related issues

🐇💻🌟
In the realm of code and logic deep,
Where functions once did play,
Classes now in harmony leap,
Refactoring the old way.
🌳📚🐾
Through GitHub's fields we tread,
Data structures anew,
Evolving with each line we've read,
Our codebase ever true.
🌈🐇💾


Recent Review Details

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 3278cd5 and 6efc3f7.
Files selected for processing (8)
  • dags/hivemind_etl_helpers/github_etl.py (2 hunks)
  • dags/hivemind_etl_helpers/src/db/github/extract/init.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/github/extract/github_extraction.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/github/extract/issues.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/github/schema/init.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/github/schema/issue.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_issue_ids.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_issues.py (2 hunks)
Additional comments not posted (7)
dags/hivemind_etl_helpers/src/db/github/schema/__init__.py (1)

4-4: Ensure that the GitHubIssueID class is correctly utilized wherever GitHubIssue is used, given that it's now part of the exported entities.

dags/hivemind_etl_helpers/src/db/github/extract/__init__.py (1)

4-5: The import changes correctly reflect the transition to an object-oriented approach. Ensure that the removed fetch_issues function is no longer used in the codebase.

dags/hivemind_etl_helpers/src/db/github/extract/github_extraction.py (1)

8-32: The GithubExtraction class is well-structured and encapsulates the issue extraction logic effectively. Ensure that the methods _fetch_raw_issues and fetch_issues are used correctly across the codebase.

Verification successful

The verification process has confirmed that the methods _fetch_raw_issues and fetch_issues from the GithubExtraction class are being used across the codebase, including in integration tests and the main ETL script. This suggests that these methods are integral to the system's functionality and are likely being tested for correctness and reliability.

  • Usage in Integration Tests: Both methods are used in multiple test scenarios, which helps ensure their reliability and correctness.
  • Usage in ETL Script: The fetch_issues method is also used in the github_etl.py script, indicating its importance in the actual data extraction process.
Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify the usage of `_fetch_raw_issues` and `fetch_issues` methods in the codebase.
rg --files-with-matches --type py "GithubExtraction" | xargs rg "_fetch_raw_issues"
rg --files-with-matches --type py "GithubExtraction" | xargs rg "fetch_issues"

Length of output: 1712

dags/hivemind_etl_helpers/src/db/github/schema/issue.py (1)

68-84: The GitHubIssueID class is correctly implemented as a subclass of GitHubIssue, focusing on the id attribute. Ensure that this class is used appropriately wherever issue IDs are specifically required.

Verification successful

The verification process has confirmed that the GitHubIssueID class is correctly and effectively used in the codebase, particularly in scenarios where only the ID of GitHub issues is required. The class is utilized in issues.py to extract and manage lists of issue IDs, which aligns with its intended purpose.

  • dags/hivemind_etl_helpers/src/db/github/extract/issues.py: The class is used to create and manage lists of GitHub issue IDs, which is appropriate for the context.
Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify the usage of `GitHubIssueID` in the codebase, especially in contexts where only the ID is needed.
rg --files-with-matches --type py "GitHubIssue" | xargs rg "GitHubIssueID"

Length of output: 947

dags/hivemind_etl_helpers/github_etl.py (1)

64-70: The modification to use GithubExtraction for fetching issues is correctly implemented. Ensure that the old fetch_issues function is fully deprecated and not used elsewhere in the codebase.

Verification successful

---

The examination of the code snippets where `fetch_issues` is referenced reveals the following:

1. **github_etl.py**: The function `fetch_issues` is not directly called; instead, the method `fetch_issues` from the `GithubExtraction` class is used. This aligns with the intended object-oriented approach.

2. **test_github_etl_fetch_issues.py**: The tests are using the `fetch_issues` method from the `GithubExtraction` class, which is appropriate and does not indicate outdated usage.

3. **github_extraction.py and issues.py**: These files define and use the `fetch_issues` method within the context of their respective classes (`GithubExtraction` and a class likely defined in `issues.py`). This usage is consistent with an object-oriented design and does not suggest deprecation is needed.

4. **github_api_helpers/issues.py**: This file contains a standalone function `fetch_issues` that fetches issues for a specific repository page by page. This function appears to be part of a different module (`github_api_helpers`) and serves a different purpose, likely used for direct API interactions rather than the ETL process.

Based on this analysis, the references to `fetch_issues` across the codebase are appropriate and reflect either method calls from class instances or function calls in a contextually different module. There is no indication that the old `fetch_issues` function, if it was a standalone function in the `github_etl.py` context, is still being used inappropriately.

**Conclusion:**
The transition to using the `GithubExtraction` class for fetching issues in the `github_etl.py` file is correctly implemented. The remaining references to `fetch_issues` in the codebase are valid and do not require deprecation. They are either method calls from new class instances or function calls in a separate module context.



</details>

<details>
<summary>Scripts Executed</summary>

The following scripts were executed for the analysis:



---

Script:
```shell
#!/bin/bash
# Search for any remaining references to the old `fetch_issues` function.
rg --files-with-matches --type py "fetch_issues"

Length of output: 332



Script:

#!/bin/bash
# Extract relevant code snippets from files where `fetch_issues` is mentioned to understand the context of its usage.

# Extract from github_etl.py
echo "=== github_etl.py ==="
rg "fetch_issues" dags/hivemind_etl_helpers/github_etl.py --context 5

# Extract from test_github_etl_fetch_issues.py
echo "=== test_github_etl_fetch_issues.py ==="
rg "fetch_issues" dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_issues.py --context 5

# Extract from github_extraction.py
echo "=== github_extraction.py ==="
rg "fetch_issues" dags/hivemind_etl_helpers/src/db/github/extract/github_extraction.py --context 5

# Extract from issues.py
echo "=== issues.py ==="
rg "fetch_issues" dags/hivemind_etl_helpers/src/db/github/extract/issues.py --context 5

# Extract from issues.py in github_api_helpers
echo "=== github_api_helpers/issues.py ==="
rg "fetch_issues" dags/github/github_api_helpers/issues.py --context 5

Length of output: 4128

dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_issues.py (1)

18-20: The integration tests are correctly updated to use the _fetch_raw_issues method from the GithubExtraction class. Ensure that these tests cover all relevant cases and that the method is robustly tested.

Verification successful

The verification process confirms that the _fetch_raw_issues method is well-covered by tests in the test_github_etl_fetch_raw_issues.py file. The tests include scenarios such as getting empty results and fetching single issues, which suggests a robust testing approach for this method. Additionally, the method's implementation is referenced in the issues.py file, aligning with the expected usage in the tests.

  • dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_issues.py: Contains multiple test cases for _fetch_raw_issues.
  • dags/hivemind_etl_helpers/src/db/github/extract/issues.py: Implementation of _fetch_raw_issues.
Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify that the `_fetch_raw_issues` method is covered by tests in other parts of the codebase.
rg --files-with-matches --type py "_fetch_raw_issues" | xargs rg "test_"

Length of output: 727

dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_issue_ids.py (1)

18-20: The integration tests are correctly updated to use the fetch_issue_ids method from the GithubExtraction class. Ensure that these tests cover all relevant cases and that the method is robustly tested.

Verification successful

The integration tests for the fetch_issue_ids method in the GithubExtraction class appear to be comprehensive. They cover various scenarios including no data, single issue retrieval, and multiple issues retrieval, which ensures that the method is robustly tested across different conditions. This aligns with the expectations set in the review comment.

  • File and tests:
    • dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_issue_ids.py
      • test_get_empty_results_no_from_date
      • test_get_empty_results
      • test_get_single_issue_id_single_repo
      • test_get_multiple_issue_ids_single_repo
      • test_get_multiple_issue_ids_single_repo_with_filtering
Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify that the `fetch_issue_ids` method is covered by tests in other parts of the codebase.
rg --files-with-matches --type py "fetch_issue_ids" | xargs rg "test_"

Length of output: 839


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review Status

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 43ee9f1 and 3278cd5.
Files selected for processing (5)
  • dags/hivemind_etl_helpers/github_etl.py (2 hunks)
  • dags/hivemind_etl_helpers/src/db/github/extract/init.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/github/extract/issues.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_issues.py (4 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_issues.py (2 hunks)
Additional comments not posted (4)
dags/hivemind_etl_helpers/src/db/github/extract/__init__.py (1)

8-9: Consider adding a comment explaining the purpose or future plans for the GithubExtraction class to improve code clarity.

dags/hivemind_etl_helpers/github_etl.py (1)

64-70: Ensure that the GithubExtraction instance is correctly integrated and utilized across the entire ETL process.

dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_issues.py (1)

18-20: Consider adding a comment explaining the necessity of directly accessing the private method __fetch_raw_issues in tests. Evaluate if there's a way to refactor the tests or the class design to avoid breaking encapsulation.

dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_issues.py (1)

18-20: LGTM! The test cases correctly utilize the GithubExtraction class for fetching GitHub issues.

dags/hivemind_etl_helpers/src/db/github/extract/issues.py Outdated Show resolved Hide resolved
@amindadgar amindadgar self-requested a review April 8, 2024 05:33
@@ -2,10 +2,10 @@
from datetime import datetime

from dotenv import load_dotenv
from hivemind_etl_helpers.src.db.github.extract import (
from hivemind_etl_helpers.src.db.github.extract import ( # fetch_issues,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the comment for fetch_issues

from_date : datetime | None
get the issues form a specific date that they were created
defualt is `None`, meaning to apply no filtering on data
def __fetch_raw_issues(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use just one underline for the method name.

self.assertEqual(issues, [])

def test_get_empty_results(self):
repository_ids = [123, 124]
issues = fetch_raw_issues(
issues = self.extractor._GithubIssueExtraction__fetch_raw_issues(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider updating to use the _fetch_raw_issues directly and not via GitHubIssueExtraction

@@ -60,7 +63,7 @@ def test_get_single_issue_single_repo(self):
)

repository_ids = [123]
issues = fetch_raw_issues(
issues = self.extractor._GithubIssueExtraction__fetch_raw_issues(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider updating to use the _fetch_raw_issues directly and not via GitHubIssueExtraction

Copy link
Member

@amindadgar amindadgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your job. Please consider doing updates based on given comments.

@amindadgar
Copy link
Member

Hey @JacobCWBillings, please have a look over this note of one of the other bounty hunters shared. https://github.com/TogetherCrew/airflow-dags/blob/1e7083ea9f2a2c28bfecc6d3a1f74c4cab742b7b/dags/hivemind_etl_helpers/src/db/github/extract/github_extraction.py#L6C1-L20C43

Can you update your PR using composition rather than inheritance?

@JacobCWBillings
Copy link
Author

JacobCWBillings commented Apr 12, 2024 via email

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Comment on lines +8 to +137
"""

query = """MATCH (i:Issue)<-[:CREATED]-(user:GitHubUser)
WHERE
i.repository_id IN $repoIds
"""
if from_date is not None:
query += "AND datetime(i.updated_at) >= datetime($from_date)"

query += """
MATCH (repo:Repository {id: i.repository_id})
RETURN
user.login as author_name,
i.id as id,
i.title as title,
i.body as text,
i.state as state,
i.state_reason as state_reason,
i.created_at as created_at,
i.updated_at as updated_at,
i.closed_at as closed_at,
i.latestSavedAt as latest_saved_at,
i.html_url as url,
i.repository_id as repository_id,
repo.full_name as repository_name
ORDER BY datetime(created_at)
"""

def _exec_query(tx, repoIds, from_date):
result = tx.run(query, repoIds=repoIds, from_date=from_date)
return list(result)

with self.neo4j_driver.session() as session:
raw_records = session.execute_read(
_exec_query,
repoIds=repository_id,
from_date=from_date,
)

return raw_records

def fetch_issues(
self,
repository_id: list[int],
from_date: datetime | None = None,
) -> list[GitHubIssue]:
"""
fetch issues from data dump in neo4j

Parameters
------------
repository_id : list[int]
a list of repository id to fetch their issues
from_date : datetime | None
get the issues form a specific date that they were created
default is `None`, meaning to apply no filtering on data

Returns
--------
github_issues : list[GitHubIssue]
list of neo4j records as the extracted issues
"""
records = self._fetch_raw_issues(repository_id, from_date)

github_issues: list[GitHubIssue] = []
for record in records:
issue = GitHubIssue.from_dict(record)
github_issues.append(issue)

return github_issues

def fetch_issue_ids(
self,
repository_id: list[int],
from_date: datetime | None = None,
) -> list[GitHubIssueID]:
"""
fetch issues from data dump in neo4j

Parameters
------------
repository_id : list[int]
a list of repository id to fetch their issues
from_date : datetime | None
get the issues form a specific date that they were created
default is `None`, meaning to apply no filtering on data

Returns
--------
github_issues_ids : list[GitHubIssueID]
list of neo4j records as the extracted issue ids
"""
records = self._fetch_raw_issues(repository_id, from_date)

github_issue_ids: list[GitHubIssueID] = []
for record in records:
issue = GitHubIssueID.from_dict(record)
github_issue_ids.append(issue)

return github_issue_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

Codebase Verification

The methods _fetch_raw_issues, fetch_issues, and fetch_issue_ids from the GithubIssueExtraction class are actively used within the codebase. However, the method _fetch_raw_issues is accessed from outside its own class (GithubIssueExtraction), specifically in github_extraction.py. This is not typical for a method prefixed with an underscore, which conventionally indicates it should be a private method. This might suggest a need to reconsider the method's visibility (i.e., renaming it to remove the underscore if it's intended to be used externally).

  • issues.py and github_extraction.py both use _fetch_raw_issues.
  • fetch_issues and fetch_issue_ids are used within their own class and also in github_extraction.py.
Analysis chain

The GithubIssueExtraction class is well-implemented, encapsulating the logic for fetching issues and issue IDs. Ensure that the methods _fetch_raw_issues, fetch_issues, and fetch_issue_ids are used correctly and tested thoroughly.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify the usage of `_fetch_raw_issues`, `fetch_issues`, and `fetch_issue_ids` methods in the codebase.
rg --files-with-matches --type py "GithubIssueExtraction" | xargs rg "_fetch_raw_issues"
rg --files-with-matches --type py "GithubIssueExtraction" | xargs rg "fetch_issues"
rg --files-with-matches --type py "GithubIssueExtraction" | xargs rg "fetch_issue_ids"

Length of output: 1470

@@ -1,5 +1,6 @@
# flake8: noqa
from .comments import fetch_comments
from .commit import fetch_commits
from .issues import fetch_issues
from .github_extraction import GithubExtraction
from .issues import GithubIssueExtraction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider removing the line 5 (from .issues import GithubIssueExtraction) because this is not always required to be used and we're always using GitHubExtraction

If we wanted to use the GitHubIssueExtraction we can just import from its file and not this module directory.

# self.comment_extraction = GitHubCommentExtraction()
self.issue_extraction = GithubIssueExtraction()

def _fetch_raw_issues(
Copy link
Member

@amindadgar amindadgar Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there any specific reason we included the _fetch_raw_issues within GitHubExtraction?

I think having the fetch_issues would be enough in our GitHubExtraction

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the _raw methods are still called from GitHubExtraction inside the test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes I see your point. To access that in test cases you can call the _fetch_raw_issues directly from GitHubExtraction, you can do the following within test cases.

self.extractor.issue_extraction._fetch_raw_issues

Copy link
Member

@amindadgar amindadgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your updates. I added some comments, please have a look @JacobCWBillings

github_issues_ids : list[GitHubIssueID]
list of neo4j records as the extracted issue ids
"""
records = self._fetch_raw_issues(repository_id, from_date)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write another function for just fetching the ids? Fetching the whole data and just using their ids is inefficient. You can name the method as _fetch_raw_issue_ids Please let me know if you need any help on the neo4j query.

@JacobCWBillings
Copy link
Author

JacobCWBillings commented Apr 25, 2024 via email

@JacobCWBillings
Copy link
Author

Implemented optimization suggestions for fetch_raw_issue_ids.

There seems to be an issue, now, with <dags.hivemind_etl_helpers.tests.integration.test_discourse_summary>.

Is this a known problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Hivemind] Refactor Github issue extraction functions to object orientated code
2 participants