-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
98 hivemind refactor GitHub issue extraction functions to object orientated code #107
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe update shifts the GitHub ETL process towards object-oriented design, introducing new classes for extracting GitHub data like issues and commits. This restructuring enhances code organization and maintainability, with functions now encapsulated within specific classes for better structure and testability. Changes
Possibly related issues
Recent Review DetailsConfiguration used: CodeRabbit UI Files selected for processing (8)
Additional comments not posted (7)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review Status
Configuration used: CodeRabbit UI
Files selected for processing (5)
- dags/hivemind_etl_helpers/github_etl.py (2 hunks)
- dags/hivemind_etl_helpers/src/db/github/extract/init.py (1 hunks)
- dags/hivemind_etl_helpers/src/db/github/extract/issues.py (1 hunks)
- dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_issues.py (4 hunks)
- dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_issues.py (2 hunks)
Additional comments not posted (4)
dags/hivemind_etl_helpers/src/db/github/extract/__init__.py (1)
8-9
: Consider adding a comment explaining the purpose or future plans for theGithubExtraction
class to improve code clarity.dags/hivemind_etl_helpers/github_etl.py (1)
64-70
: Ensure that theGithubExtraction
instance is correctly integrated and utilized across the entire ETL process.dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_issues.py (1)
18-20
: Consider adding a comment explaining the necessity of directly accessing the private method__fetch_raw_issues
in tests. Evaluate if there's a way to refactor the tests or the class design to avoid breaking encapsulation.dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_issues.py (1)
18-20
: LGTM! The test cases correctly utilize theGithubExtraction
class for fetching GitHub issues.
@@ -2,10 +2,10 @@ | |||
from datetime import datetime | |||
|
|||
from dotenv import load_dotenv | |||
from hivemind_etl_helpers.src.db.github.extract import ( | |||
from hivemind_etl_helpers.src.db.github.extract import ( # fetch_issues, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the comment for fetch_issues
from_date : datetime | None | ||
get the issues form a specific date that they were created | ||
defualt is `None`, meaning to apply no filtering on data | ||
def __fetch_raw_issues( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use just one underline for the method name.
self.assertEqual(issues, []) | ||
|
||
def test_get_empty_results(self): | ||
repository_ids = [123, 124] | ||
issues = fetch_raw_issues( | ||
issues = self.extractor._GithubIssueExtraction__fetch_raw_issues( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider updating to use the _fetch_raw_issues
directly and not via GitHubIssueExtraction
@@ -60,7 +63,7 @@ def test_get_single_issue_single_repo(self): | |||
) | |||
|
|||
repository_ids = [123] | |||
issues = fetch_raw_issues( | |||
issues = self.extractor._GithubIssueExtraction__fetch_raw_issues( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider updating to use the _fetch_raw_issues
directly and not via GitHubIssueExtraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your job. Please consider doing updates based on given comments.
Hey @JacobCWBillings, please have a look over this note of one of the other bounty hunters shared. https://github.com/TogetherCrew/airflow-dags/blob/1e7083ea9f2a2c28bfecc6d3a1f74c4cab742b7b/dags/hivemind_etl_helpers/src/db/github/extract/github_extraction.py#L6C1-L20C43 Can you update your PR using composition rather than inheritance? |
Ciao Mohammad,
Thank you for the involvement in the project. I has been nice seeing how
many ideas have been contributed to this project, so far.
I have been traveling this past week. But today, I will have time to
implement the edits you indicate, as well as to push a solution to the
second bounty.
All the best,
-Jacob
…On Wed, Apr 10, 2024, 16:42 Mohammad Amin Dadgar ***@***.***> wrote:
Hey @JacobCWBillings <https://github.com/JacobCWBillings>, please have a
look over this note of one of the other bounty hunters shared.
https://github.com/TogetherCrew/airflow-dags/blob/1e7083ea9f2a2c28bfecc6d3a1f74c4cab742b7b/dags/hivemind_etl_helpers/src/db/github/extract/github_extraction.py#L6C1-L20C43
Can you update your PR using composition rather than inheritance?
—
Reply to this email directly, view it on GitHub
<#107 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI2QN2OFBVC2LOXHHEUTEDLY4VFVZAVCNFSM6AAAAABFYEQZ36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBXG42DQMBYHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Bounties/jcwb 00
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
""" | ||
|
||
query = """MATCH (i:Issue)<-[:CREATED]-(user:GitHubUser) | ||
WHERE | ||
i.repository_id IN $repoIds | ||
""" | ||
if from_date is not None: | ||
query += "AND datetime(i.updated_at) >= datetime($from_date)" | ||
|
||
query += """ | ||
MATCH (repo:Repository {id: i.repository_id}) | ||
RETURN | ||
user.login as author_name, | ||
i.id as id, | ||
i.title as title, | ||
i.body as text, | ||
i.state as state, | ||
i.state_reason as state_reason, | ||
i.created_at as created_at, | ||
i.updated_at as updated_at, | ||
i.closed_at as closed_at, | ||
i.latestSavedAt as latest_saved_at, | ||
i.html_url as url, | ||
i.repository_id as repository_id, | ||
repo.full_name as repository_name | ||
ORDER BY datetime(created_at) | ||
""" | ||
|
||
def _exec_query(tx, repoIds, from_date): | ||
result = tx.run(query, repoIds=repoIds, from_date=from_date) | ||
return list(result) | ||
|
||
with self.neo4j_driver.session() as session: | ||
raw_records = session.execute_read( | ||
_exec_query, | ||
repoIds=repository_id, | ||
from_date=from_date, | ||
) | ||
|
||
return raw_records | ||
|
||
def fetch_issues( | ||
self, | ||
repository_id: list[int], | ||
from_date: datetime | None = None, | ||
) -> list[GitHubIssue]: | ||
""" | ||
fetch issues from data dump in neo4j | ||
|
||
Parameters | ||
------------ | ||
repository_id : list[int] | ||
a list of repository id to fetch their issues | ||
from_date : datetime | None | ||
get the issues form a specific date that they were created | ||
default is `None`, meaning to apply no filtering on data | ||
|
||
Returns | ||
-------- | ||
github_issues : list[GitHubIssue] | ||
list of neo4j records as the extracted issues | ||
""" | ||
records = self._fetch_raw_issues(repository_id, from_date) | ||
|
||
github_issues: list[GitHubIssue] = [] | ||
for record in records: | ||
issue = GitHubIssue.from_dict(record) | ||
github_issues.append(issue) | ||
|
||
return github_issues | ||
|
||
def fetch_issue_ids( | ||
self, | ||
repository_id: list[int], | ||
from_date: datetime | None = None, | ||
) -> list[GitHubIssueID]: | ||
""" | ||
fetch issues from data dump in neo4j | ||
|
||
Parameters | ||
------------ | ||
repository_id : list[int] | ||
a list of repository id to fetch their issues | ||
from_date : datetime | None | ||
get the issues form a specific date that they were created | ||
default is `None`, meaning to apply no filtering on data | ||
|
||
Returns | ||
-------- | ||
github_issues_ids : list[GitHubIssueID] | ||
list of neo4j records as the extracted issue ids | ||
""" | ||
records = self._fetch_raw_issues(repository_id, from_date) | ||
|
||
github_issue_ids: list[GitHubIssueID] = [] | ||
for record in records: | ||
issue = GitHubIssueID.from_dict(record) | ||
github_issue_ids.append(issue) | ||
|
||
return github_issue_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tip
Codebase Verification
The methods _fetch_raw_issues
, fetch_issues
, and fetch_issue_ids
from the GithubIssueExtraction
class are actively used within the codebase. However, the method _fetch_raw_issues
is accessed from outside its own class (GithubIssueExtraction
), specifically in github_extraction.py
. This is not typical for a method prefixed with an underscore, which conventionally indicates it should be a private method. This might suggest a need to reconsider the method's visibility (i.e., renaming it to remove the underscore if it's intended to be used externally).
issues.py
andgithub_extraction.py
both use_fetch_raw_issues
.fetch_issues
andfetch_issue_ids
are used within their own class and also ingithub_extraction.py
.
Analysis chain
The GithubIssueExtraction
class is well-implemented, encapsulating the logic for fetching issues and issue IDs. Ensure that the methods _fetch_raw_issues
, fetch_issues
, and fetch_issue_ids
are used correctly and tested thoroughly.
Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify the usage of `_fetch_raw_issues`, `fetch_issues`, and `fetch_issue_ids` methods in the codebase.
rg --files-with-matches --type py "GithubIssueExtraction" | xargs rg "_fetch_raw_issues"
rg --files-with-matches --type py "GithubIssueExtraction" | xargs rg "fetch_issues"
rg --files-with-matches --type py "GithubIssueExtraction" | xargs rg "fetch_issue_ids"
Length of output: 1470
@@ -1,5 +1,6 @@ | |||
# flake8: noqa | |||
from .comments import fetch_comments | |||
from .commit import fetch_commits | |||
from .issues import fetch_issues | |||
from .github_extraction import GithubExtraction | |||
from .issues import GithubIssueExtraction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider removing the line 5 (from .issues import GithubIssueExtraction
) because this is not always required to be used and we're always using GitHubExtraction
If we wanted to use the GitHubIssueExtraction
we can just import from its file and not this module directory.
# self.comment_extraction = GitHubCommentExtraction() | ||
self.issue_extraction = GithubIssueExtraction() | ||
|
||
def _fetch_raw_issues( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there any specific reason we included the _fetch_raw_issues
within GitHubExtraction?
I think having the fetch_issues would be enough in our GitHubExtraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the _raw methods are still called from GitHubExtraction inside the test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes I see your point. To access that in test cases you can call the _fetch_raw_issues
directly from GitHubExtraction
, you can do the following within test cases.
self.extractor.issue_extraction._fetch_raw_issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your updates. I added some comments, please have a look @JacobCWBillings
github_issues_ids : list[GitHubIssueID] | ||
list of neo4j records as the extracted issue ids | ||
""" | ||
records = self._fetch_raw_issues(repository_id, from_date) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you write another function for just fetching the ids? Fetching the whole data and just using their ids is inefficient. You can name the method as _fetch_raw_issue_ids
Please let me know if you need any help on the neo4j query.
That sounds good. Will implement this functionality.
…On Thu, Apr 25, 2024, 07:18 Mohammad Amin Dadgar ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In dags/hivemind_etl_helpers/src/db/github/extract/issues.py
<#107 (comment)>
:
> + fetch issues from data dump in neo4j
+
+ Parameters
+ ------------
+ repository_id : list[int]
+ a list of repository id to fetch their issues
+ from_date : datetime | None
+ get the issues form a specific date that they were created
+ default is `None`, meaning to apply no filtering on data
+
+ Returns
+ --------
+ github_issues_ids : list[GitHubIssueID]
+ list of neo4j records as the extracted issue ids
+ """
+ records = self._fetch_raw_issues(repository_id, from_date)
Could you write another function for just fetching the ids? Fetching the
whole data and just using their ids is inefficient. You can name the method
as _fetch_raw_issue_ids Please let me know if you need any help on the
neo4j query.
—
Reply to this email directly, view it on GitHub
<#107 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI2QN2MKT2UIWJXG3O32ANDY7CG3TAVCNFSM6AAAAABFYEQZ36VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMRRGUYTEOBYGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Implemented optimization suggestions for fetch_raw_issue_ids. There seems to be an issue, now, with <dags.hivemind_etl_helpers.tests.integration.test_discourse_summary>. Is this a known problem? |
Summary by CodeRabbit
GithubExtraction
andGithubIssueExtraction
classes.GitHubIssueID
for handling specific attributes related to GitHub issues.