-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Hivemind include url for documents #330
Conversation
WalkthroughThe changes in this pull request focus on enhancing metadata handling in the Discord and GitHub data transformation processes. In Discord, the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (3)
dags/hivemind_etl_helpers/src/db/github/schema/commit.py (1)
Line range hint
11-61
: Consider implementing a migration strategyThe rename from
html_url
tourl
is a breaking change that affects multiple layers:
- Data storage/retrieval
- Object instantiation
- Serialization/deserialization
- External API contracts
Consider implementing one of these migration strategies:
- Temporary support for both field names during transition
- Version the API/schema
- Database migration scripts
dags/hivemind_etl_helpers/tests/integration/test_github_transformation.py (1)
Let me analyze the results to verify the URL field naming consistency. The search results show:
- In schema files:
schema/comment.py
,schema/issue.py
,schema/commit.py
, andschema/pull_request.py
consistently useurl
fieldschema/commit.py
uses bothurl
andapi_url
fieldsschema/pull_request.py
uses bothurl
andissue_url
fields
- In extract SQL files:
extract/comment.py
,extract/commit.py
,extract/issues.py
, andextract/pull_requests.py
consistently maphtml_url
tourl
in their SQL queries
- In test files:
- Most test files still use
html_url
in their test data setup- The transformation test (
test_github_transformation.py
) correctly uses the newurl
fieldURL field naming has been consistently updated in schema and extraction layers, but test data setup still uses old naming.
- Test files need to be updated to use
url
instead ofhtml_url
for consistency:
test_github_etl_fetch_comments.py
test_github_etl_fetch_raw_commits.py
test_github_etl_fetch_raw_pull_requests.py
test_github_etl_fetch_raw_issues.py
test_github_etl_fetch_pull_requests.py
test_github_etl_fetch_comments_filtered.py
test_github_etl_fetch_issues.py
🔗 Analysis chain
Line range hint
107-113
: Verify URL field naming consistency across the codebase.The metadata assertion correctly uses the new
url
field. Let's verify that this naming convention is consistently applied across all GitHub-related files.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for any remaining instances of html_url in GitHub-related files # and ensure consistent usage of the new url field. # Search for any remaining html_url occurrences in GitHub-related files echo "Checking for remaining html_url occurrences:" rg "html_url" --type py # Search for the new url field to verify consistent usage echo -e "\nVerifying consistent url field usage:" rg "url.*=|url['\"]:" --type pyLength of output: 95693
dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_commits.py (1)
Line range hint
25-35
: Update Neo4j test data setup to use consistent URL field namesThe test assertions have been updated to use
url
, but the Neo4j queries still setparents.0.html_url
. This inconsistency should be addressed.Update the Neo4j setup queries in all test methods. For example:
CREATE (co:GitHubCommit) SET co.`commit.author.name` = "Author#1", co.`commit.message` = "Issue #1 is resolved!", co.`commit.url` = "https://api.sample_url_for_commit.html", - co.`parents.0.html_url` = "https://sample_url_for_commit.html", + co.`parents.0.url` = "https://sample_url_for_commit.html", co.repository_id = 123,This change should be applied to all test methods to maintain consistency with the new field naming.
Also applies to: 78-88, 127-137, 171-181, 252-262
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (8)
dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py
(4 hunks)dags/hivemind_etl_helpers/src/db/github/extract/commit.py
(1 hunks)dags/hivemind_etl_helpers/src/db/github/schema/commit.py
(4 hunks)dags/hivemind_etl_helpers/src/db/github/transform/commits.py
(1 hunks)dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_llama.py
(4 hunks)dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_commits.py
(5 hunks)dags/hivemind_etl_helpers/tests/integration/test_github_transformation.py
(2 hunks)dags/hivemind_etl_helpers/tests/unit/test_github_transform_commits.py
(8 hunks)
🔇 Additional comments (20)
dags/hivemind_etl_helpers/src/db/github/transform/commits.py (1)
30-30
: LGTM! Verify consistent URL field naming across the codebase.
The change from html_url
to url
in the excluded metadata keys aligns with the broader refactoring effort. The implementation looks correct.
Let's verify the consistency of this rename across the codebase:
✅ Verification successful
The URL field rename is consistently implemented across the codebase
The verification shows that:
- The change from
html_url
tourl
is consistently implemented in the extraction layer wherehtml_url
is aliased asurl
in all SQL queries - All transform modules correctly use
url
in their metadata fields - The excluded metadata keys in transform modules consistently use
url
(nothtml_url
) - The only remaining
html_url
references are in:- Test files (which is expected for test data)
- Extract layer SQL queries where it's properly aliased as
url
- One utility function in
github_api_helpers
that parses the URL to determine comment type
The change is verified to be consistent across the codebase and properly implemented.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining instances of html_url to ensure complete migration
# and verify the consistent use of the new 'url' field.
echo "Checking for any remaining html_url references:"
rg "html_url" -l
echo "\nVerifying usage of new 'url' field in GitHub-related files:"
rg "url" --type python -g "*github*.py" -B 2 -A 2
Length of output: 1426
Script:
#!/bin/bash
# Let's examine the content of files that still contain html_url references
# to determine if they are test data or actual implementation
# Check content of extract files containing html_url
echo "=== Checking extract implementation files ==="
for file in dags/hivemind_etl_helpers/src/db/github/extract/*.py; do
echo "=== $file ==="
rg "html_url" -C 3 "$file"
done
# Check transform files for url usage
echo -e "\n=== Checking transform files for url usage ==="
fd "transform.*\.py$" --full-path --exec rg "url" -C 3 {}
# Check if html_url appears in any non-test Python files outside extract
echo -e "\n=== Checking other implementation files for html_url ==="
rg "html_url" --type py -g "!*test*" -g "!*/extract/*" -C 3
Length of output: 30686
dags/hivemind_etl_helpers/src/db/github/schema/commit.py (4)
11-11
: Verify all constructor calls are updated
The parameter rename from html_url
to url
is a breaking change that requires updates to all constructor calls.
✅ Verification successful
Parameter rename from html_url
to url
is properly updated
All direct constructor calls to GitHubCommit
in the test files are already using the new url
parameter name. The html_url
occurrences found in the codebase are related to other entities (comments, pull requests, issues) and not to the GitHubCommit
class.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for direct constructor calls to GitHubCommit
rg "GitHubCommit\(" -A 10 -B 2
# Search for dictionary construction that might be used with from_dict
rg "html_url.*?['\"].*?['\"]" --type py
Length of output: 15978
61-61
: Verify data consumer compatibility
The change from html_url
to url
in the serialized output requires updates to all consumers of this data.
29-29
: Verify all attribute accesses are updated
The instance variable rename from html_url
to url
requires updates to all code accessing this attribute.
✅ Verification successful
Let me analyze the search results to verify the attribute rename impact.
No impact from attribute rename - changes are consistent
The attribute rename from html_url
to url
in GitHubCommit
class is safe because:
- The class definition shows this is a schema class where the attribute is properly renamed everywhere (constructor, from_dict, to_dict methods)
- All occurrences of
html_url
in the codebase are in different contexts:- SQL queries in extract modules where
html_url
is aliased tourl
- Test data setup using different classes (issues, pull requests, comments)
- API response handling in a different module (github_api_helpers)
- SQL queries in extract modules where
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for direct attribute access
rg "\.html_url" --type py
# Search for string literals that might be used in attribute access
rg "'html_url'|\"html_url\"" --type py
Length of output: 8546
Script:
#!/bin/bash
# Let's check the schema file to understand the context better
cat dags/hivemind_etl_helpers/src/db/github/schema/commit.py
# Also check if there are any direct attribute accesses in Python code
ast-grep --pattern 'html_url'
Length of output: 2938
45-45
: Verify data source compatibility
The change from html_url
to url
in data extraction requires that all data sources (APIs, databases, etc.) provide the correct field name.
✅ Verification successful
Based on the search results, I can now provide a final response. The test files and implementation show that the GitHub API response structure consistently uses the url
field, and the codebase is already aligned with this change. The test data and actual implementation both use the url
field for commit URLs.
Change from html_url
to url
is correct
The codebase consistently uses the url
field in both test data and implementation:
- The GitHubCommit class expects and processes the
url
field - Test fixtures and integration tests use the
url
field for commit data - The Neo4j query extracts the URL using
parents.0.html_url
asurl
in the response
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for Neo4j queries or API calls that might return this field
rg -B 5 -A 5 "RETURN.*html_url|html_url.*FROM" --type py
rg -B 5 -A 5 "response\[.*html_url" --type py
Length of output: 106
Script:
#!/bin/bash
# Let's check the GitHub API response structure and usage patterns
rg -B 5 -A 5 '"url".*github' --type py
rg -B 5 -A 5 'github.*"url"' --type py
# Let's also check the commit schema and related files
fd commit.py --exec cat {}
Length of output: 70190
dags/hivemind_etl_helpers/src/db/github/extract/commit.py (1)
81-81
: Verify Neo4j schema compatibility for the URL field rename.
The change from co.parents.0.html_url
to co.parents.0.url
aligns with the standardization of URL field naming across the codebase. However, we should verify that this field exists in the Neo4j schema to prevent runtime issues.
Let's verify the field existence in Neo4j schema:
dags/hivemind_etl_helpers/tests/unit/test_github_transform_commits.py (6)
45-46
: LGTM! Metadata structure is consistent.
The metadata assertions correctly validate both URL fields with proper GitHub URL patterns.
64-65
: LGTM! URL fields properly structured with PR relation.
The URL fields are correctly implemented alongside the PR relationship data.
78-79
: LGTM! Consistent URL structure maintained.
The URL fields maintain consistency with the established pattern.
91-92
: LGTM! Good test coverage with repository variants.
The URL fields remain consistent while testing different repository scenarios, providing good coverage of various cases.
117-118
: LGTM! Comprehensive test coverage for URL field assertions.
The metadata assertions thoroughly validate URL fields across different scenarios:
- Commits with and without related PRs
- Different repository IDs and names
- Various verification states
Also applies to: 135-136, 153-154
23-24
: LGTM! Verify URL field usage across the codebase.
The URL field renaming is correctly implemented in the test data. The URL format follows GitHub's standard pattern.
Let's verify the consistency of this change across the codebase:
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_llama.py (4)
171-172
: LGTM: URL construction for regular messages
The URL construction follows Discord's URL format correctly, using the appropriate channel ID for regular messages.
184-185
: LGTM: URL construction for messages with mentions
The URL construction is correct for messages containing user mentions, maintaining consistency with Discord's URL format.
209-210
: LGTM: Clear distinction between Discord URL and referenced URLs
The metadata correctly maintains both the Discord message URL (url
) and the referenced external URL (url_reference
), serving different purposes.
197-199
: Verify the URL construction for threaded messages
The URL is constructed using threadId
(88888) instead of channelId
(1313132) for threaded messages. While this might be intentional, we should verify this behavior.
Let's verify Discord's URL format for threaded messages:
✅ Verification successful
URL construction for threaded messages is correct
The code in transform_discord_raw_messges.py
explicitly shows that Discord URLs are constructed differently based on whether the message is in a thread or not:
- For regular messages:
discord.com/channels/{guild_id}/{channel_id}/{message_id}
- For threaded messages:
discord.com/channels/{guild_id}/{thread_id}/{message_id}
The test case correctly uses the thread ID (88888) instead of the channel ID for a threaded message, matching the implementation logic.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for other instances of URL construction for threaded messages
rg -l 'channels/.*/.*/.*' | xargs rg 'threadId.*messageId'
# Search for tests specifically handling thread URLs
fd -e py | xargs rg 'thread.*url.*discord\.com/channels'
Length of output: 119
Script:
#!/bin/bash
# Let's try a broader search for Discord URL construction patterns
rg -A 3 "discord\.com/channels"
# Also search for any URL-related tests or constants
rg -A 3 "url.*discord\.com"
# Look for thread-related URL handling
ast-grep --pattern 'url = $_ + "/channels/"'
Length of output: 5012
dags/hivemind_etl_helpers/tests/integration/test_github_transformation.py (1)
86-89
: LGTM! Clear distinction between URL types.
The change from html_url
to url
maintains a clear separation between the web URL and API URL fields.
dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py (3)
98-101
: LGTM: Clean extraction of message identifiers
The extraction of message identifiers is well-organized and follows consistent naming conventions.
177-177
: LGTM: Consistent metadata field addition
The URL field is appropriately added to the metadata dictionary, maintaining consistency with the codebase-wide change from 'html_url' to 'url'.
247-247
: Verify the decision to exclude URLs from metadata
The URL is being excluded from both embed and LLM metadata. Consider if this is the intended behavior:
- URLs might be useful for traceability in embeddings
- They could provide context for LLM processing
Let's check if this aligns with how URLs are handled in other document types:
Also applies to: 264-264
✅ Verification successful
URL exclusion from metadata is consistent with codebase patterns
The exclusion of URLs from both embed and LLM metadata in Discord messages aligns with the established patterns across other document types in the codebase:
- MediaWiki documents exclude URLs from both embed and LLM metadata
- GitHub resources (comments, issues, commits) consistently exclude URLs from LLM metadata
- This consistent treatment suggests a deliberate architectural decision to exclude URLs from metadata processing
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check how URL fields are handled in other document types
rg -A 5 "excluded_(embed|llm)_metadata_keys.*url" --glob '!transform_discord_raw_messges.py'
Length of output: 2457
dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py
Show resolved
Hide resolved
dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_commits.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py (2)
108-108
: Add a comment explaining the messageId values.The hardcoded messageIds make the tests deterministic, which is good. However, consider adding a comment explaining why these specific values were chosen (e.g., "Using sequential messageIds starting from 10000000000 for consistent and traceable test data").
Also applies to: 126-126, 144-144, 162-162, 182-182
274-274
: Add test cases for URL edge cases.While the URL construction looks correct, consider adding test cases for:
- Messages with invalid/malformed guild_id, channel_id, or message_id
- Messages in archived threads
- Messages in forum posts
Also applies to: 288-288, 303-303, 313-313
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py
(9 hunks)dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_commits.py
(2 hunks)
🔇 Additional comments (2)
dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_commits.py (1)
65-65
: Verify URL field renaming impact
The change from html_url
to url
might affect other parts of the codebase. Let's verify the consistency of this change.
Also applies to: 109-109
✅ Verification successful
The URL field change is consistent with the codebase design
The change from html_url
to url
in the test assertion is correct and consistent with the codebase. The Neo4j queries in hivemind_etl_helpers/src/db/github/extract/*.py
show that html_url
fields are consistently aliased as url
in the extraction layer:
pr.html_url as url
c.html_url AS url
co.parents.0.html_url AS url
i.html_url as url
This indicates that the test is correctly asserting against the transformed field name as per the ETL design.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining instances of html_url and potential inconsistencies
# Search for remaining html_url occurrences in Python files
echo "Searching for remaining html_url occurrences:"
rg --type python "html_url" -l
# Search for Neo4j queries that might need updating
echo -e "\nSearching for Neo4j queries with html_url:"
rg --type python "parents\.0\.html_url"
# Search for related test assertions
echo -e "\nSearching for test assertions with url/html_url:"
rg --type python "assertEqual.*html_url|assertEqual.*url"
Length of output: 540
Script:
#!/bin/bash
# Let's try without the --type python flag since it's not supported
# Search for remaining html_url occurrences
echo "Searching for remaining html_url occurrences:"
rg "html_url"
# Search for Neo4j queries that might need updating
echo -e "\nSearching for Neo4j queries with html_url:"
rg "parents\.0\.html_url"
# Search for test assertions with url/html_url
echo -e "\nSearching for test assertions with url/html_url:"
rg "assertEqual.*?[\"']url[\"']|assertEqual.*?[\"']html_url[\"']"
# Search for url field usage in the specific test file
echo -e "\nSearching in the test file:"
cat "dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_commits.py"
Length of output: 16944
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py (1)
274-274
: Verify URL construction implementation.
Let's verify that the URL construction logic in the implementation handles all cases correctly.
Also applies to: 288-288, 303-303, 313-313
✅ Verification successful
URL construction implementation is correctly handled
The verification shows that the URL construction is implemented correctly in transform_discord_raw_messges.py
with proper handling of both regular channel messages and thread messages:
- For regular messages:
https://discord.com/channels/{guild_id}/{channel_id}/{message_id}
- For thread messages:
https://discord.com/channels/{guild_id}/{thread_id}/{message_id}
The test files confirm this implementation with comprehensive test cases covering both scenarios, including proper handling of guild IDs, channel IDs, thread IDs, and message IDs.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the URL construction implementation in the codebase
# Find the file containing the URL construction logic
echo "Searching for URL construction implementation..."
rg -l "discord.com/channels" --type py
# Look for any existing URL validation or construction methods
echo "Checking for URL-related methods..."
ast-grep --pattern 'def $_(url$$$) { $$$ }' --lang py
Length of output: 966
Script:
#!/bin/bash
# Let's check the URL construction implementation in the found files
echo "Checking URL construction in transform_discord_raw_messges.py..."
rg -A 5 -B 5 "discord.com/channels" dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py
echo "\nChecking URL construction in test files..."
rg -A 5 -B 5 "discord.com/channels" dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_llama.py
# Let's also check for any URL-related functions with a corrected pattern
echo "\nChecking for URL-related functions..."
ast-grep -p 'def $_($$$) { $$$ discord.com/channels $$$ }' --lang py
Length of output: 12305
dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_commits.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (6)
dags/hivemind_etl_helpers/src/db/discourse/utils/transform_raw_to_documents.py (1)
47-47
: Consider metadata structure optimizationThe metadata dictionary includes both
forum_endpoint
and the fulllink
. Since theforum_endpoint
can be extracted from the link if needed, consider removing it to avoid redundancy.metadata={ "author_name": post["author_name"], "author_username": post["author_username"], - "forum_endpoint": forum_endpoint, "createdAt": post["createdAt"], "updatedAt": post["updatedAt"], "postId": post["postId"], "topic": post["topic"], "category": post["category"], "authorTrustLevel": post["authorTrustLevel"], "liker_usernames": post["liker_usernames"], "liker_names": post["liker_names"], "replier_usernames": post["replier_usernames"], "replier_names": post["replier_names"], "link": link, },
Also applies to: 58-58
dags/hivemind_etl_helpers/src/db/discourse/fetch_raw_posts.py (2)
Line range hint
8-23
: Update function docstring to document new return fields.The function's docstring should be updated to document the new
topic_id
andpost_number
fields in the return value.Returns --------- raw_records : list[neo4j._data.Record] - list of neo4j records as the result + list of neo4j records containing: + - topic_id: The ID of the associated topic + - post_number: The sequential number of the post within its topic + - ... (other existing fields)
Line range hint
140-160
: Add missingpost_number
field to grouped results.The
post_number
field is added to the main query but missing from the grouped results structure. This inconsistency could affect URL construction for grouped posts.Add the missing field to maintain consistency with
fetch_raw_posts
:RETURN date, COLLECT({ author_username: author.username, author_name: author.name, topic: topic, postId: postId, forum_endpoint: $forum_endpoint, + post_number: p.postNumber, raw: raw, createdAt: createdAt, updatedAt: updatedAt,
dags/hivemind_etl_helpers/tests/integration/test_discourse_fetch_posts.py (3)
58-59
: Ensure consistent floating-point format for postNumberThe
postNumber
property is set as a floating-point value (1.0). While this works, consider documenting why floating-point is used instead of integer, as post numbers are typically sequential integers.
175-176
: Consider adding boundary test cases for post numbersWhile the current tests cover basic scenarios, consider adding test cases for:
- Posts with very large numbers
- Posts with missing post numbers
- Posts with duplicate post numbers
This would help ensure robust handling of edge cases.
Example test case to add:
def test_fetch_posts_with_edge_case_numbers(self): neo4j_ops = Neo4jConnection().neo4j_ops # Setup test data with edge cases neo4j_ops.neo4j_driver.execute_query( """ CREATE (p:DiscoursePost) SET p.endpoint = "wwwdwadeswdpoi123", p.postNumber = 999999.0 // Very large number """ ) # Add assertions for edge casesAlso applies to: 204-205, 241-241
Line range hint
1-254
: Consider implementing test data fixturesThe test file contains a lot of duplicate test data setup code. Consider implementing test fixtures or helper methods to:
- Centralize test data creation
- Reduce code duplication
- Make test maintenance easier
Example implementation:
@classmethod def setUpClass(cls): cls.neo4j_ops = Neo4jConnection().neo4j_ops def create_test_post(self, post_number, created_at, **kwargs): """Helper method to create test posts with default values""" return self.neo4j_ops.neo4j_driver.execute_query( """ CREATE (p:DiscoursePost) SET p.endpoint = $endpoint, p.postNumber = $post_number, p.createdAt = $created_at // ... other default properties RETURN p """, endpoint="wwwdwadeswdpoi123", post_number=post_number, created_at=created_at, **kwargs )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
dags/hivemind_etl_helpers/src/db/discourse/fetch_raw_posts.py
(1 hunks)dags/hivemind_etl_helpers/src/db/discourse/utils/transform_raw_to_documents.py
(2 hunks)dags/hivemind_etl_helpers/src/db/telegram/transform/messages.py
(4 hunks)dags/hivemind_etl_helpers/tests/integration/test_discourse_fetch_posts.py
(8 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- dags/hivemind_etl_helpers/src/db/telegram/transform/messages.py
🔇 Additional comments (5)
dags/hivemind_etl_helpers/src/db/discourse/utils/transform_raw_to_documents.py (1)
Line range hint 36-58
: Verify URL construction across the codebase
Let's verify that the URL construction pattern is consistent with other parts of the codebase.
✅ Verification successful
URL construction pattern is consistent across the codebase
The verification shows that the URL construction pattern https://{forum_endpoint}/t/{topic_id}/{post_number}
is consistently used throughout the codebase. This is evidenced by:
- The implementation in
transform_raw_to_documents.py
- Multiple matching test cases in
test_discourse_transform_raw_data.py
that follow the same pattern - Consistent usage of
forum_endpoint
from the post dictionary
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for similar URL construction patterns in the codebase
rg -A 2 "https://.*?/t/.*?/" --type py
# Search for forum_endpoint usage
ast-grep --pattern 'post["forum_endpoint"]'
Length of output: 4543
dags/hivemind_etl_helpers/src/db/discourse/fetch_raw_posts.py (1)
58-58
: LGTM! Verify field availability.
The addition of topic_id
and post_number
fields aligns with the PR objective to include URLs for documents. The field references are correct.
Let's verify that these fields are consistently available in the database:
Also applies to: 61-61
✅ Verification successful
Fields are used correctly for URL construction
Based on the evidence found, the topic_id
and post_number
fields are correctly used in URL construction (transform_raw_to_documents.py
). The URL pattern follows the format: https://{forum_endpoint}/t/{topic_id}/{post_number}
, which is a standard Discourse URL structure. Since these fields are essential for URL construction and are part of the core Discourse data model, they will be consistently available in the database.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that topic_id and postNumber are consistently present in the database
# to avoid potential null values in URL construction
ast-grep --pattern 'CREATE (p:DiscoursePost {$$$})
SET p.postNumber = $_,
p.topicId = $_'
Length of output: 100
Script:
#!/bin/bash
# Let's try a different approach to verify the field usage:
# 1. Check if these fields are consistently used in queries
# 2. Look for any null checks or error handling around these fields
# Check query patterns and field usage
rg -U "MATCH.*\(t:DiscourseTopic\).*\(p:DiscoursePost\).*RETURN.*t\.id.*p\.postNumber" --type py
# Check for any null handling or validation
rg "(?i)(topic_id|post_number).*(None|null|missing|empty)" --type py -A 2
# Look for any error handling related to these fields
rg "(?i)(topic_id|post_number).*(try|except|error|raise)" --type py -A 2
Length of output: 9955
Script:
#!/bin/bash
# Let's check the schema and data handling in the codebase:
# 1. Look for field definitions or schema declarations
# 2. Check how these fields are used in URL construction
# Check for schema or field definitions
rg -U "class.*DiscoursePost|class.*DiscourseTopic" --type py -A 10
# Look for URL construction using these fields
rg "(?i)(url|link).*\b(topic_id|post_number)\b" --type py -A 2
# Check the actual query usage in the codebase
rg -U "MATCH.*\(p:DiscoursePost\)" --type py
Length of output: 2835
dags/hivemind_etl_helpers/tests/integration/test_discourse_fetch_posts.py (3)
87-88
: LGTM: Consistent with previous post number pattern
The second post correctly follows the same floating-point format (2.0) as the first post.
Line range hint 140-152
: LGTM: Comprehensive assertions for second post
The assertions for the second post properly validate all fields including the new post_number
field, maintaining consistency with the first post's assertions.
Line range hint 124-136
: Verify post_number field handling in fetch_raw_posts function
The test assertions have been updated to include the new post_number
field. Let's verify the implementation in the source function.
✅ Verification successful
Let's search for the implementation to verify the post_number field handling. The previous ast-grep pattern might not have matched due to different syntax or file type, so let's try a broader search approach.
Let's check the actual query implementation to verify how the post_number field is handled in the fetch_raw_posts function.
Let's try different search patterns to locate the Cypher query implementation in the fetch_raw_posts function.
Post number field is correctly implemented in fetch_raw_posts function
The Cypher query in the fetch_raw_posts
function correctly handles the post_number
field by retrieving it from the postNumber
property of the DiscoursePost node:
p.postNumber as post_number
This matches the test assertion that expects the post_number
field in the returned data.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for post_number field handling in fetch_raw_posts function
ast-grep --pattern 'RETURN {
$$$
post_number: p.postNumber,
$$$
}'
Length of output: 74
Script:
#!/bin/bash
# Search for files containing fetch_raw_posts function
fd -e py -e cypher | xargs rg -l "fetch_raw_posts"
# After finding the files, let's examine the content
fd -e py -e cypher | xargs rg "fetch_raw_posts" -A 10 -B 10
Length of output: 28677
Script:
#!/bin/bash
# Let's examine the Cypher query in fetch_raw_posts function to verify post_number field
rg -A 50 "MATCH \(p:DiscoursePost\)" dags/hivemind_etl_helpers/src/db/discourse/fetch_raw_posts.py
Length of output: 100
Script:
#!/bin/bash
# Let's try a simpler search pattern to find the Cypher query
rg -A 50 "MATCH.*DiscoursePost" dags/hivemind_etl_helpers/src/db/discourse/fetch_raw_posts.py
# Also search for the query string definition
rg -B 5 -A 50 "query = " dags/hivemind_etl_helpers/src/db/discourse/fetch_raw_posts.py
Length of output: 8459
dags/hivemind_etl_helpers/src/db/discourse/utils/transform_raw_to_documents.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py (2)
259-291
: LGTM! Well-structured test for single document transformation.The test thoroughly verifies the document transformation logic and metadata handling.
Remove the unnecessary f-string prefix:
- "url": f"https://drive.google.com/file/d/file_1/view", + "url": "https://drive.google.com/file/d/file_1/view",🧰 Tools
🪛 Ruff (0.8.0)
288-288: f-string without any placeholders
Remove extraneous
f
prefix(F541)
292-373
: LGTM! Comprehensive test coverage for multiple documents.The test thoroughly verifies the transformation logic across multiple documents. However, there are a few improvements we can make:
- Remove unnecessary f-string prefixes:
- "url": f"https://drive.google.com/file/d/file_1/view", + "url": "https://drive.google.com/file/d/file_1/view", - "url": f"https://drive.google.com/file/d/file_2/view", + "url": "https://drive.google.com/file/d/file_2/view", - "url": f"https://drive.google.com/file/d/file_3/view", + "url": "https://drive.google.com/file/d/file_3/view",
- Consider refactoring the test data setup to reduce duplication:
def create_test_document(doc_id, file_id, author): return Document( doc_id=doc_id, text="test", metadata={ "file id": f"file_{file_id}", "author": f"author_{author}", "file name": f"file_name_{file_id}", "mime type": "mime", "created at": "date", "modified at": "modified", }, ) # In test method: documents = [ create_test_document(i, i, i) for i in range(1, 4) ]🧰 Tools
🪛 Ruff (0.8.0)
345-345: f-string without any placeholders
Remove extraneous
f
prefix(F541)
358-358: f-string without any placeholders
Remove extraneous
f
prefix(F541)
371-371: f-string without any placeholders
Remove extraneous
f
prefix(F541)
dags/hivemind_etl_helpers/src/db/gdrive/gdrive_loader.py (1)
122-123
: Type annotation compatibility with earlier Python versionsThe type annotation
str | None
requires Python 3.10 or later. If the project needs to support earlier Python versions, consider usingOptional[str]
from thetyping
module for compatibility.You can update the type annotation as follows:
-from typing import List, Optional +from typing import List, Optional for doc in transformed_docs: - file_id: str | None = doc.metadata.get("file id") + file_id: Optional[str] = doc.metadata.get("file id")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
dags/hivemind_etl_helpers/src/db/gdrive/gdrive_loader.py
(2 hunks)dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py
288-288: f-string without any placeholders
Remove extraneous f
prefix
(F541)
345-345: f-string without any placeholders
Remove extraneous f
prefix
(F541)
358-358: f-string without any placeholders
Remove extraneous f
prefix
(F541)
371-371: f-string without any placeholders
Remove extraneous f
prefix
(F541)
🔇 Additional comments (1)
dags/hivemind_etl_helpers/src/db/gdrive/gdrive_loader.py (1)
53-54
: Integration of document transformation step
The addition of the _transform_google_documents
method to transform documents before returning them is a good enhancement. This ensures all documents include the necessary URL metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
dags/hivemind_etl_helpers/src/db/gdrive/gdrive_loader.py (2)
119-128
: LGTM with suggestions for improvementThe implementation correctly addresses the deep copy concern from the previous review and properly handles the URL construction.
Consider extracting the URL format as a class constant for better maintainability:
class GoogleDriveLoader: + GDRIVE_FILE_URL_FORMAT = "https://drive.google.com/file/d/{}/view" def __init__(self, refresh_token: str): self.refresh_token = refresh_token def _transform_google_documents(self, documents: list[Document]) -> list[Document]: transformed_docs: list[Document] = copy.deepcopy(documents) for doc in transformed_docs: file_id: str | None = doc.metadata.get("file id") if file_id is None: doc.metadata["url"] = None else: - doc.metadata["url"] = f"https://drive.google.com/file/d/{file_id}/view" + doc.metadata["url"] = self.GDRIVE_FILE_URL_FORMAT.format(file_id)
115-118
: Enhance method documentationThe docstring could be more descriptive about the method's behavior and return value.
Consider updating the docstring:
def _transform_google_documents(self, documents: list[Document]) -> list[Document]: """ - transform google extracted documents by inserting their metadata a url + Transform Google Drive documents by adding URL to their metadata. + + Args: + documents: List of Document objects with Google Drive metadata + + Returns: + List of transformed Document objects with 'url' added to metadata """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
dags/hivemind_etl_helpers/src/db/gdrive/gdrive_loader.py
(3 hunks)
🔇 Additional comments (2)
dags/hivemind_etl_helpers/src/db/gdrive/gdrive_loader.py (2)
1-1
: LGTM: Import added correctly
The copy
module import is appropriately placed and necessary for the deep copying functionality.
53-55
: LGTM: Document transformation properly integrated
The transformation step is correctly placed after all documents are loaded but before returning. This ensures consistent processing of all documents regardless of their source (folders, drives, or files).
Let's verify that all callers of load_data
can handle the modified document structure:
✅ Verification successful
LGTM: Document transformation is safely integrated
The transformation is correctly implemented and the changes are safe because:
- The URL field is only added to the document metadata, not modifying any existing fields
- The transformation is properly tested as shown in the test files
- All callers of
load_data
receive the documents through theDocument
class interface and don't make assumptions about specific metadata fields - The URL field is safely handled with a fallback to
None
when file_id is missing
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for all callers of load_data to ensure they handle the new URL field
rg -l "load_data\(" | xargs rg "metadata\[.?['\"]url['\"]"
Length of output: 309
Script:
#!/bin/bash
# Let's check the actual transformation logic and its usage
ast-grep --pattern 'def _transform_google_documents($$$)'
Length of output: 59
Script:
#!/bin/bash
# Let's check the implementation with ripgrep since ast-grep didn't work
rg -A 10 "_transform_google_documents"
Length of output: 4484
Script:
#!/bin/bash
# Let's check where the transformed documents are used
rg -B 5 -A 5 "load_data\(" dags/hivemind_etl_helpers/src/db/gdrive/
Length of output: 6005
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py (2)
121-147
: LGTM! Consider extracting test dataThe expected results correctly include the new URL field with proper Google Drive URL format. Consider extracting the test data into a separate fixture or helper method to improve test maintainability.
def create_test_document(file_id: str, author: str) -> Document: return Document( id_=f"{file_id}.docx", metadata={ "file_name": f"{file_id}.docx", "file id": file_id, "author": author, "file name": "Option", "url": f"https://drive.google.com/file/d/{file_id}/view", }, text="Option 1: Keep it super casual" )
284-316
: Remove unnecessary f-string prefixThe test is well-structured and validates the document transformation correctly. However, there's an unnecessary f-string prefix that should be removed.
- "url": f"https://drive.google.com/file/d/file_1/view", + "url": "https://drive.google.com/file/d/file_1/view",🧰 Tools
🪛 Ruff (0.8.0)
313-313: f-string without any placeholders
Remove extraneous
f
prefix(F541)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py
313-313: f-string without any placeholders
Remove extraneous f
prefix
(F541)
370-370: f-string without any placeholders
Remove extraneous f
prefix
(F541)
383-383: f-string without any placeholders
Remove extraneous f
prefix
(F541)
396-396: f-string without any placeholders
Remove extraneous f
prefix
(F541)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py (1)
121-147
: LGTM! Consider extracting URL construction logicThe updated test correctly validates the new URL field in document metadata. However, the URL construction pattern is duplicated across multiple test cases.
Consider extracting the URL construction into a helper method:
def _build_gdrive_url(file_id: str) -> str: return f"https://drive.google.com/file/d/{file_id}/view"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py
(2 hunks)
🔇 Additional comments (1)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py (1)
284-398
: 🛠️ Refactor suggestion
Refactor tests using parameterized testing
While the tests thoroughly validate the document transformation functionality, there's significant duplication in test data and assertions.
As suggested in the previous review, consider using parameterized testing to improve maintainability. The pattern of creating test documents and validating their metadata is repeated multiple times.
import pytest
@pytest.mark.parametrize("doc_id,file_id,author", [
(1, "file_1", "author_1"),
(2, "file_2", "author_2"),
(3, "file_3", "author_3"),
])
def test_transform_document(self, doc_id, file_id, author):
loader = GoogleDriveLoader(refresh_token=self.refresh_token)
document = Document(
doc_id=doc_id,
text="test",
metadata={
"file id": file_id,
"author": author,
"file name": f"file_name_{doc_id}",
"mime type": "mime",
"created at": "date",
"modified at": "modified",
},
)
transformed_docs = loader._transform_google_documents(documents=[document])
expected_metadata = {
"file id": file_id,
"author": author,
"file name": f"file_name_{doc_id}",
"mime type": "mime",
"created at": "date",
"modified at": "modified",
"url": f"https://drive.google.com/file/d/{file_id}/view",
}
assert len(transformed_docs) == 1
assert transformed_docs[0].metadata == expected_metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
dags/hivemind_etl_helpers/notion_etl.py (1)
96-120
: Consider enhancing URL handling robustnessWhile the implementation is functional, consider these improvements:
- Extract the Notion base URL as a class constant or configuration value
- Add URL encoding for the page_id to handle special characters
- Add example metadata structure in the docstring
Here's a suggested improvement:
class NotionProcessor: + NOTION_BASE_URL = "https://www.notion.so" + def __init__( self, community_id: str, access_token: str | None = None, ) -> None: # ... existing code ... def _transform_documents(self, documents: list[Document]) -> list[Document]: """ transform notion extracted documents by inserting their metadata a url Parameters ------------ documents : list[Document] a list of notion extracted pages Returns --------- documents : list[Document] a list of documents each included with url in its metadata + Example + -------- + Input document metadata: {"page_id": "123abc", "title": "My Page"} + Output document metadata: {"page_id": "123abc", "title": "My Page", "url": "https://www.notion.so/123abc"} """ transformed_docs: list[Document] = copy.deepcopy(documents) for doc in transformed_docs: page_id: str | None = doc.metadata.get("page_id") if page_id is None: doc.metadata["url"] = None else: - doc.metadata["url"] = f"https://www.notion.so/{page_id}" + # URL encode the page_id to handle special characters + encoded_page_id = urllib.parse.quote(page_id) + doc.metadata["url"] = f"{self.NOTION_BASE_URL}/{encoded_page_id}" return transformed_docsDon't forget to add the import:
import urllib.parse
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
dags/hivemind_etl_helpers/notion_etl.py
(4 hunks)
🔇 Additional comments (4)
dags/hivemind_etl_helpers/notion_etl.py (4)
61-62
: LGTM: Clean integration of document transformation
The transformation step is properly placed before pipeline execution, maintaining the existing flow while adding URL metadata.
77-78
: LGTM: Consistent implementation of document transformation
The transformation step is consistently implemented across methods, ensuring uniform URL metadata addition.
93-94
: LGTM: Maintains consistency in document transformation
The transformation step is properly integrated, maintaining consistency with other processing methods.
96-120
: Verify URL handling consistency across document types
Let's ensure the URL handling is consistent with other document types in the system.
✅ Verification successful
URL handling is consistent with other document sources
The URL handling in the Notion ETL is consistent with other document sources in the codebase. Specifically:
- Google Drive documents follow the same pattern of URL handling in metadata, using a similar structure:
doc.metadata["url"] = f"https://drive.google.com/file/d/{file_id}/view"
- Both implementations handle the case where the ID is missing by setting
doc.metadata["url"] = None
- The URL construction follows a consistent pattern across document sources, where each source uses its specific base URL with the respective ID
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check URL handling patterns across different document types
# Check for URL field usage patterns
echo "Checking URL field usage patterns:"
rg -g '!*.pyc' -g '!*test*' 'metadata\[.url.|\.url\s*=' --type py
# Check for Notion URL construction patterns
echo "Checking Notion URL construction patterns:"
rg -g '!*.pyc' 'notion\.so' --type py
# Check for URL encoding usage
echo "Checking URL encoding usage:"
rg -g '!*.pyc' 'urllib\.parse\.quote' --type py
Length of output: 1278
Script:
#!/bin/bash
# Let's check the Document class definition and its usage patterns
echo "Checking Document class definition and metadata handling:"
ast-grep --pattern 'class Document'
# Let's also check for any document transformation patterns
echo "Checking document transformation patterns:"
rg -g '!*.pyc' '_transform.*document' --type py -A 5
# Check for metadata field patterns across the codebase
echo "Checking metadata field patterns:"
rg -g '!*.pyc' 'metadata\[".+"\]' --type py
Length of output: 25165
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
dags/hivemind_etl_helpers/notion_etl.py (2)
98-108
: Fix typo in docstring.There's a typo in the docstring: "inlcuded" should be "included".
- a list of documents each inlcuded with url in its metadata + a list of documents each included with url in its metadata
113-118
: Consider enhancing URL construction robustness.The current URL construction could be improved in several ways:
- Consider validating the page_id format before URL construction
- The Notion domain is hardcoded - consider making it configurable
- URL construction might need URL-safe encoding for the page_id
Here's a suggested improvement:
+ NOTION_BASE_URL = "https://www.notion.so" + def _transform_documents(self, documents: list[Document]) -> list[Document]: transformed_docs: list[Document] = copy.deepcopy(documents) for doc in transformed_docs: page_id: str | None = doc.metadata.get("page_id") - if page_id is None: + if not page_id or not isinstance(page_id, str): doc.metadata["url"] = None else: - doc.metadata["url"] = f"https://www.notion.so/{page_id}" + from urllib.parse import quote + safe_page_id = quote(page_id) + doc.metadata["url"] = f"{self.NOTION_BASE_URL}/{safe_page_id}"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
dags/hivemind_etl_helpers/notion_etl.py
(4 hunks)
🔇 Additional comments (2)
dags/hivemind_etl_helpers/notion_etl.py (2)
61-62
: LGTM! Consistent transformation step added before ingestion.
The changes consistently add document transformation before pipeline ingestion across all processing methods, maintaining good code organization and clarity.
Also applies to: 77-78, 93-94
96-120
: Verify URL format compatibility with dependent systems.
The new URL metadata field might be consumed by other components. Let's verify the compatibility.
✅ Verification successful
URL format is consistent with existing implementations
The codebase search reveals that the URL metadata field is used in a similar pattern in gdrive_loader.py
, where it's either set to None
or a valid URL. The Notion URL format follows the same pattern:
- Both implementations handle the case of missing identifiers by setting URL to
None
- Both use standard HTTPS URLs with domain-specific formats
- No other components are accessing this metadata field
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any code that might be accessing the URL field in document metadata
rg -g '!*.pyc' -g '!*test*' 'metadata\[.url.\]|metadata\.get\(.url.\)'
# Search for any tests that might need updating for the new URL field
rg -g '*test*.py' 'metadata\[.url.\]|metadata\.get\(.url.\)'
Length of output: 576
Summary by CodeRabbit
New Features
Bug Fixes
Tests