Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Hivemind include url for documents #330

Merged
merged 15 commits into from
Nov 28, 2024

Conversation

amindadgar
Copy link
Member

@amindadgar amindadgar commented Nov 27, 2024

Summary by CodeRabbit

  • New Features

    • Enhanced message transformation process to include additional metadata and improved URL generation for Discord messages.
    • Updated GitHub commit data handling to consistently use the new URL format.
    • Introduced URL construction for Telegram messages, improving metadata representation.
    • Added topic ID and post number to discourse posts, enhancing the data returned from fetch functions.
    • Implemented transformation for Google Drive documents to include URLs in their metadata.
    • Added URL metadata transformation for Notion documents during processing.
  • Bug Fixes

    • Resolved inconsistencies in URL field handling across various components.
  • Tests

    • Updated test cases to validate the new URL structure for both Discord messages and GitHub commits, ensuring accuracy in metadata.
    • Enhanced tests for discourse posts to include and validate the new post number property.
    • Standardized test data for Discord message IDs, improving consistency in test results.
    • Added tests for the transformation functionality of Google Drive documents, ensuring correct metadata handling.
    • Expanded test coverage for Notion document processing to verify URL metadata inclusion.

@amindadgar amindadgar linked an issue Nov 27, 2024 that may be closed by this pull request
6 tasks
Copy link
Contributor

coderabbitai bot commented Nov 27, 2024

Walkthrough

The changes in this pull request focus on enhancing metadata handling in the Discord and GitHub data transformation processes. In Discord, the transform_discord_raw_messages.py file has been updated to include new parameters for message metadata and improved URL construction. In GitHub-related files, the html_url field has been replaced with url across various classes and methods, affecting how commit data is processed and tested. These modifications ensure consistent metadata handling and improve the accuracy of URLs generated for messages and commits.

Changes

File Change Summary
dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py Updated transform_discord_raw_messages and prepare_document functions to handle new metadata. Added logic for constructing URLs based on threadId.
dags/hivemind_etl_helpers/src/db/github/extract/commit.py Modified GithubCommitExtraction class to change html_url to url in the Neo4j query result.
dags/hivemind_etl_helpers/src/db/github/schema/commit.py Changed parameter name from html_url to url in GitHubCommit class methods and constructor.
dags/hivemind_etl_helpers/src/db/github/transform/commits.py Updated excluded_llm_metadata_keys to replace html_url with url.
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_llama.py Enhanced tests to include new url field in expected metadata for transformed documents.
dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_commits.py Updated assertions in tests to check for url instead of html_url in commit objects.
dags/hivemind_etl_helpers/tests/integration/test_github_transformation.py Renamed html_url to url in transformation tests for GitHubCommit.
dags/hivemind_etl_helpers/tests/unit/test_github_transform_commits.py Changed instances of html_url to url in test cases for GitHubCommit.
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py Hardcoded messageId values in tests and updated expected metadata to include new url field.
dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_commits.py Renamed html_url to url in assertions for fetched commit data in tests.

Possibly related PRs

🐇 In the land where messages play,
A new URL guides the way.
From Discord's threads to GitHub's flow,
Metadata shines, watch it grow!
With each commit and message bright,
Our data dances, pure delight! 🌟


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (3)
dags/hivemind_etl_helpers/src/db/github/schema/commit.py (1)

Line range hint 11-61: Consider implementing a migration strategy

The rename from html_url to url is a breaking change that affects multiple layers:

  1. Data storage/retrieval
  2. Object instantiation
  3. Serialization/deserialization
  4. External API contracts

Consider implementing one of these migration strategies:

  1. Temporary support for both field names during transition
  2. Version the API/schema
  3. Database migration scripts
dags/hivemind_etl_helpers/tests/integration/test_github_transformation.py (1)

Let me analyze the results to verify the URL field naming consistency. The search results show:

  1. In schema files:
  • schema/comment.py, schema/issue.py, schema/commit.py, and schema/pull_request.py consistently use url field
  • schema/commit.py uses both url and api_url fields
  • schema/pull_request.py uses both url and issue_url fields
  1. In extract SQL files:
  • extract/comment.py, extract/commit.py, extract/issues.py, and extract/pull_requests.py consistently map html_url to url in their SQL queries
  1. In test files:
  • Most test files still use html_url in their test data setup
  • The transformation test (test_github_transformation.py) correctly uses the new url field

URL field naming has been consistently updated in schema and extraction layers, but test data setup still uses old naming.

  • Test files need to be updated to use url instead of html_url for consistency:
    • test_github_etl_fetch_comments.py
    • test_github_etl_fetch_raw_commits.py
    • test_github_etl_fetch_raw_pull_requests.py
    • test_github_etl_fetch_raw_issues.py
    • test_github_etl_fetch_pull_requests.py
    • test_github_etl_fetch_comments_filtered.py
    • test_github_etl_fetch_issues.py
🔗 Analysis chain

Line range hint 107-113: Verify URL field naming consistency across the codebase.

The metadata assertion correctly uses the new url field. Let's verify that this naming convention is consistently applied across all GitHub-related files.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining instances of html_url in GitHub-related files
# and ensure consistent usage of the new url field.

# Search for any remaining html_url occurrences in GitHub-related files
echo "Checking for remaining html_url occurrences:"
rg "html_url" --type py

# Search for the new url field to verify consistent usage
echo -e "\nVerifying consistent url field usage:"
rg "url.*=|url['\"]:" --type py

Length of output: 95693

dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_commits.py (1)

Line range hint 25-35: Update Neo4j test data setup to use consistent URL field names

The test assertions have been updated to use url, but the Neo4j queries still set parents.0.html_url. This inconsistency should be addressed.

Update the Neo4j setup queries in all test methods. For example:

 CREATE (co:GitHubCommit)
     SET
         co.`commit.author.name` = "Author#1",
         co.`commit.message` = "Issue #1 is resolved!",
         co.`commit.url` = "https://api.sample_url_for_commit.html",
-        co.`parents.0.html_url` = "https://sample_url_for_commit.html",
+        co.`parents.0.url` = "https://sample_url_for_commit.html",
         co.repository_id = 123,

This change should be applied to all test methods to maintain consistency with the new field naming.

Also applies to: 78-88, 127-137, 171-181, 252-262

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c6db4c1 and 72348fd.

📒 Files selected for processing (8)
  • dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py (4 hunks)
  • dags/hivemind_etl_helpers/src/db/github/extract/commit.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/github/schema/commit.py (4 hunks)
  • dags/hivemind_etl_helpers/src/db/github/transform/commits.py (1 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_llama.py (4 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_commits.py (5 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_github_transformation.py (2 hunks)
  • dags/hivemind_etl_helpers/tests/unit/test_github_transform_commits.py (8 hunks)
🔇 Additional comments (20)
dags/hivemind_etl_helpers/src/db/github/transform/commits.py (1)

30-30: LGTM! Verify consistent URL field naming across the codebase.

The change from html_url to url in the excluded metadata keys aligns with the broader refactoring effort. The implementation looks correct.

Let's verify the consistency of this rename across the codebase:

✅ Verification successful

The URL field rename is consistently implemented across the codebase

The verification shows that:

  1. The change from html_url to url is consistently implemented in the extraction layer where html_url is aliased as url in all SQL queries
  2. All transform modules correctly use url in their metadata fields
  3. The excluded metadata keys in transform modules consistently use url (not html_url)
  4. The only remaining html_url references are in:
    • Test files (which is expected for test data)
    • Extract layer SQL queries where it's properly aliased as url
    • One utility function in github_api_helpers that parses the URL to determine comment type

The change is verified to be consistent across the codebase and properly implemented.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining instances of html_url to ensure complete migration
# and verify the consistent use of the new 'url' field.

echo "Checking for any remaining html_url references:"
rg "html_url" -l

echo "\nVerifying usage of new 'url' field in GitHub-related files:"
rg "url" --type python -g "*github*.py" -B 2 -A 2

Length of output: 1426


Script:

#!/bin/bash
# Let's examine the content of files that still contain html_url references
# to determine if they are test data or actual implementation

# Check content of extract files containing html_url
echo "=== Checking extract implementation files ==="
for file in dags/hivemind_etl_helpers/src/db/github/extract/*.py; do
    echo "=== $file ==="
    rg "html_url" -C 3 "$file"
done

# Check transform files for url usage
echo -e "\n=== Checking transform files for url usage ==="
fd "transform.*\.py$" --full-path --exec rg "url" -C 3 {}

# Check if html_url appears in any non-test Python files outside extract
echo -e "\n=== Checking other implementation files for html_url ==="
rg "html_url" --type py -g "!*test*" -g "!*/extract/*" -C 3

Length of output: 30686

dags/hivemind_etl_helpers/src/db/github/schema/commit.py (4)

11-11: Verify all constructor calls are updated

The parameter rename from html_url to url is a breaking change that requires updates to all constructor calls.

✅ Verification successful

Parameter rename from html_url to url is properly updated

All direct constructor calls to GitHubCommit in the test files are already using the new url parameter name. The html_url occurrences found in the codebase are related to other entities (comments, pull requests, issues) and not to the GitHubCommit class.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for direct constructor calls to GitHubCommit
rg "GitHubCommit\(" -A 10 -B 2

# Search for dictionary construction that might be used with from_dict
rg "html_url.*?['\"].*?['\"]" --type py

Length of output: 15978


61-61: Verify data consumer compatibility

The change from html_url to url in the serialized output requires updates to all consumers of this data.


29-29: Verify all attribute accesses are updated

The instance variable rename from html_url to url requires updates to all code accessing this attribute.

✅ Verification successful

Let me analyze the search results to verify the attribute rename impact.


No impact from attribute rename - changes are consistent

The attribute rename from html_url to url in GitHubCommit class is safe because:

  1. The class definition shows this is a schema class where the attribute is properly renamed everywhere (constructor, from_dict, to_dict methods)
  2. All occurrences of html_url in the codebase are in different contexts:
    • SQL queries in extract modules where html_url is aliased to url
    • Test data setup using different classes (issues, pull requests, comments)
    • API response handling in a different module (github_api_helpers)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for direct attribute access
rg "\.html_url" --type py

# Search for string literals that might be used in attribute access
rg "'html_url'|\"html_url\"" --type py

Length of output: 8546


Script:

#!/bin/bash
# Let's check the schema file to understand the context better
cat dags/hivemind_etl_helpers/src/db/github/schema/commit.py

# Also check if there are any direct attribute accesses in Python code
ast-grep --pattern 'html_url'

Length of output: 2938


45-45: Verify data source compatibility

The change from html_url to url in data extraction requires that all data sources (APIs, databases, etc.) provide the correct field name.

✅ Verification successful

Based on the search results, I can now provide a final response. The test files and implementation show that the GitHub API response structure consistently uses the url field, and the codebase is already aligned with this change. The test data and actual implementation both use the url field for commit URLs.

Change from html_url to url is correct

The codebase consistently uses the url field in both test data and implementation:

  • The GitHubCommit class expects and processes the url field
  • Test fixtures and integration tests use the url field for commit data
  • The Neo4j query extracts the URL using parents.0.html_url as url in the response
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for Neo4j queries or API calls that might return this field
rg -B 5 -A 5 "RETURN.*html_url|html_url.*FROM" --type py
rg -B 5 -A 5 "response\[.*html_url" --type py

Length of output: 106


Script:

#!/bin/bash
# Let's check the GitHub API response structure and usage patterns
rg -B 5 -A 5 '"url".*github' --type py
rg -B 5 -A 5 'github.*"url"' --type py

# Let's also check the commit schema and related files
fd commit.py --exec cat {}

Length of output: 70190

dags/hivemind_etl_helpers/src/db/github/extract/commit.py (1)

81-81: Verify Neo4j schema compatibility for the URL field rename.

The change from co.parents.0.html_url to co.parents.0.url aligns with the standardization of URL field naming across the codebase. However, we should verify that this field exists in the Neo4j schema to prevent runtime issues.

Let's verify the field existence in Neo4j schema:

dags/hivemind_etl_helpers/tests/unit/test_github_transform_commits.py (6)

45-46: LGTM! Metadata structure is consistent.

The metadata assertions correctly validate both URL fields with proper GitHub URL patterns.


64-65: LGTM! URL fields properly structured with PR relation.

The URL fields are correctly implemented alongside the PR relationship data.


78-79: LGTM! Consistent URL structure maintained.

The URL fields maintain consistency with the established pattern.


91-92: LGTM! Good test coverage with repository variants.

The URL fields remain consistent while testing different repository scenarios, providing good coverage of various cases.


117-118: LGTM! Comprehensive test coverage for URL field assertions.

The metadata assertions thoroughly validate URL fields across different scenarios:

  • Commits with and without related PRs
  • Different repository IDs and names
  • Various verification states

Also applies to: 135-136, 153-154


23-24: LGTM! Verify URL field usage across the codebase.

The URL field renaming is correctly implemented in the test data. The URL format follows GitHub's standard pattern.

Let's verify the consistency of this change across the codebase:

dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_llama.py (4)

171-172: LGTM: URL construction for regular messages

The URL construction follows Discord's URL format correctly, using the appropriate channel ID for regular messages.


184-185: LGTM: URL construction for messages with mentions

The URL construction is correct for messages containing user mentions, maintaining consistency with Discord's URL format.


209-210: LGTM: Clear distinction between Discord URL and referenced URLs

The metadata correctly maintains both the Discord message URL (url) and the referenced external URL (url_reference), serving different purposes.


197-199: Verify the URL construction for threaded messages

The URL is constructed using threadId (88888) instead of channelId (1313132) for threaded messages. While this might be intentional, we should verify this behavior.

Let's verify Discord's URL format for threaded messages:

✅ Verification successful

URL construction for threaded messages is correct

The code in transform_discord_raw_messges.py explicitly shows that Discord URLs are constructed differently based on whether the message is in a thread or not:

  • For regular messages: discord.com/channels/{guild_id}/{channel_id}/{message_id}
  • For threaded messages: discord.com/channels/{guild_id}/{thread_id}/{message_id}

The test case correctly uses the thread ID (88888) instead of the channel ID for a threaded message, matching the implementation logic.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other instances of URL construction for threaded messages
rg -l 'channels/.*/.*/.*' | xargs rg 'threadId.*messageId'

# Search for tests specifically handling thread URLs
fd -e py | xargs rg 'thread.*url.*discord\.com/channels'

Length of output: 119


Script:

#!/bin/bash
# Let's try a broader search for Discord URL construction patterns
rg -A 3 "discord\.com/channels" 

# Also search for any URL-related tests or constants
rg -A 3 "url.*discord\.com"

# Look for thread-related URL handling
ast-grep --pattern 'url = $_ + "/channels/"'

Length of output: 5012

dags/hivemind_etl_helpers/tests/integration/test_github_transformation.py (1)

86-89: LGTM! Clear distinction between URL types.

The change from html_url to url maintains a clear separation between the web URL and API URL fields.

dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py (3)

98-101: LGTM: Clean extraction of message identifiers

The extraction of message identifiers is well-organized and follows consistent naming conventions.


177-177: LGTM: Consistent metadata field addition

The URL field is appropriately added to the metadata dictionary, maintaining consistency with the codebase-wide change from 'html_url' to 'url'.


247-247: Verify the decision to exclude URLs from metadata

The URL is being excluded from both embed and LLM metadata. Consider if this is the intended behavior:

  • URLs might be useful for traceability in embeddings
  • They could provide context for LLM processing

Let's check if this aligns with how URLs are handled in other document types:

Also applies to: 264-264

✅ Verification successful

URL exclusion from metadata is consistent with codebase patterns

The exclusion of URLs from both embed and LLM metadata in Discord messages aligns with the established patterns across other document types in the codebase:

  • MediaWiki documents exclude URLs from both embed and LLM metadata
  • GitHub resources (comments, issues, commits) consistently exclude URLs from LLM metadata
  • This consistent treatment suggests a deliberate architectural decision to exclude URLs from metadata processing
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check how URL fields are handled in other document types
rg -A 5 "excluded_(embed|llm)_metadata_keys.*url" --glob '!transform_discord_raw_messges.py'

Length of output: 2457

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py (2)

108-108: Add a comment explaining the messageId values.

The hardcoded messageIds make the tests deterministic, which is good. However, consider adding a comment explaining why these specific values were chosen (e.g., "Using sequential messageIds starting from 10000000000 for consistent and traceable test data").

Also applies to: 126-126, 144-144, 162-162, 182-182


274-274: Add test cases for URL edge cases.

While the URL construction looks correct, consider adding test cases for:

  • Messages with invalid/malformed guild_id, channel_id, or message_id
  • Messages in archived threads
  • Messages in forum posts

Also applies to: 288-288, 303-303, 313-313

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 72348fd and 6329b86.

📒 Files selected for processing (2)
  • dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py (9 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_commits.py (2 hunks)
🔇 Additional comments (2)
dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_commits.py (1)

65-65: Verify URL field renaming impact

The change from html_url to url might affect other parts of the codebase. Let's verify the consistency of this change.

Also applies to: 109-109

✅ Verification successful

The URL field change is consistent with the codebase design

The change from html_url to url in the test assertion is correct and consistent with the codebase. The Neo4j queries in hivemind_etl_helpers/src/db/github/extract/*.py show that html_url fields are consistently aliased as url in the extraction layer:

  • pr.html_url as url
  • c.html_url AS url
  • co.parents.0.html_url AS url
  • i.html_url as url

This indicates that the test is correctly asserting against the transformed field name as per the ETL design.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining instances of html_url and potential inconsistencies

# Search for remaining html_url occurrences in Python files
echo "Searching for remaining html_url occurrences:"
rg --type python "html_url" -l

# Search for Neo4j queries that might need updating
echo -e "\nSearching for Neo4j queries with html_url:"
rg --type python "parents\.0\.html_url"

# Search for related test assertions
echo -e "\nSearching for test assertions with url/html_url:"
rg --type python "assertEqual.*html_url|assertEqual.*url"

Length of output: 540


Script:

#!/bin/bash
# Let's try without the --type python flag since it's not supported
# Search for remaining html_url occurrences
echo "Searching for remaining html_url occurrences:"
rg "html_url"

# Search for Neo4j queries that might need updating
echo -e "\nSearching for Neo4j queries with html_url:"
rg "parents\.0\.html_url"

# Search for test assertions with url/html_url
echo -e "\nSearching for test assertions with url/html_url:"
rg "assertEqual.*?[\"']url[\"']|assertEqual.*?[\"']html_url[\"']"

# Search for url field usage in the specific test file
echo -e "\nSearching in the test file:"
cat "dags/hivemind_etl_helpers/tests/integration/test_github_etl_fetch_raw_commits.py"

Length of output: 16944

dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py (1)

274-274: Verify URL construction implementation.

Let's verify that the URL construction logic in the implementation handles all cases correctly.

Also applies to: 288-288, 303-303, 313-313

✅ Verification successful

URL construction implementation is correctly handled

The verification shows that the URL construction is implemented correctly in transform_discord_raw_messges.py with proper handling of both regular channel messages and thread messages:

  • For regular messages: https://discord.com/channels/{guild_id}/{channel_id}/{message_id}
  • For thread messages: https://discord.com/channels/{guild_id}/{thread_id}/{message_id}

The test files confirm this implementation with comprehensive test cases covering both scenarios, including proper handling of guild IDs, channel IDs, thread IDs, and message IDs.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the URL construction implementation in the codebase

# Find the file containing the URL construction logic
echo "Searching for URL construction implementation..."
rg -l "discord.com/channels" --type py

# Look for any existing URL validation or construction methods
echo "Checking for URL-related methods..."
ast-grep --pattern 'def $_(url$$$) { $$$ }' --lang py

Length of output: 966


Script:

#!/bin/bash
# Let's check the URL construction implementation in the found files
echo "Checking URL construction in transform_discord_raw_messges.py..."
rg -A 5 -B 5 "discord.com/channels" dags/hivemind_etl_helpers/src/db/discord/utils/transform_discord_raw_messges.py

echo "\nChecking URL construction in test files..."
rg -A 5 -B 5 "discord.com/channels" dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_document_from_db.py dags/hivemind_etl_helpers/tests/integration/test_discord_prepare_llama.py

# Let's also check for any URL-related functions with a corrected pattern
echo "\nChecking for URL-related functions..."
ast-grep -p 'def $_($$$) { $$$ discord.com/channels $$$ }' --lang py

Length of output: 12305

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (6)
dags/hivemind_etl_helpers/src/db/discourse/utils/transform_raw_to_documents.py (1)

47-47: Consider metadata structure optimization

The metadata dictionary includes both forum_endpoint and the full link. Since the forum_endpoint can be extracted from the link if needed, consider removing it to avoid redundancy.

                metadata={
                    "author_name": post["author_name"],
                    "author_username": post["author_username"],
-                    "forum_endpoint": forum_endpoint,
                    "createdAt": post["createdAt"],
                    "updatedAt": post["updatedAt"],
                    "postId": post["postId"],
                    "topic": post["topic"],
                    "category": post["category"],
                    "authorTrustLevel": post["authorTrustLevel"],
                    "liker_usernames": post["liker_usernames"],
                    "liker_names": post["liker_names"],
                    "replier_usernames": post["replier_usernames"],
                    "replier_names": post["replier_names"],
                    "link": link,
                },

Also applies to: 58-58

dags/hivemind_etl_helpers/src/db/discourse/fetch_raw_posts.py (2)

Line range hint 8-23: Update function docstring to document new return fields.

The function's docstring should be updated to document the new topic_id and post_number fields in the return value.

     Returns
     ---------
     raw_records : list[neo4j._data.Record]
-        list of neo4j records as the result
+        list of neo4j records containing:
+        - topic_id: The ID of the associated topic
+        - post_number: The sequential number of the post within its topic
+        - ... (other existing fields)

Line range hint 140-160: Add missing post_number field to grouped results.

The post_number field is added to the main query but missing from the grouped results structure. This inconsistency could affect URL construction for grouped posts.

Add the missing field to maintain consistency with fetch_raw_posts:

         RETURN
             date,
             COLLECT({
                 author_username: author.username,
                 author_name: author.name,
                 topic: topic,
                 postId: postId,
                 forum_endpoint: $forum_endpoint,
+                post_number: p.postNumber,
                 raw: raw,
                 createdAt: createdAt,
                 updatedAt: updatedAt,
dags/hivemind_etl_helpers/tests/integration/test_discourse_fetch_posts.py (3)

58-59: Ensure consistent floating-point format for postNumber

The postNumber property is set as a floating-point value (1.0). While this works, consider documenting why floating-point is used instead of integer, as post numbers are typically sequential integers.


175-176: Consider adding boundary test cases for post numbers

While the current tests cover basic scenarios, consider adding test cases for:

  1. Posts with very large numbers
  2. Posts with missing post numbers
  3. Posts with duplicate post numbers

This would help ensure robust handling of edge cases.

Example test case to add:

def test_fetch_posts_with_edge_case_numbers(self):
    neo4j_ops = Neo4jConnection().neo4j_ops
    # Setup test data with edge cases
    neo4j_ops.neo4j_driver.execute_query(
        """
        CREATE (p:DiscoursePost)
        SET
            p.endpoint = "wwwdwadeswdpoi123",
            p.postNumber = 999999.0  // Very large number
        """
    )
    # Add assertions for edge cases

Also applies to: 204-205, 241-241


Line range hint 1-254: Consider implementing test data fixtures

The test file contains a lot of duplicate test data setup code. Consider implementing test fixtures or helper methods to:

  1. Centralize test data creation
  2. Reduce code duplication
  3. Make test maintenance easier

Example implementation:

@classmethod
def setUpClass(cls):
    cls.neo4j_ops = Neo4jConnection().neo4j_ops
    
def create_test_post(self, post_number, created_at, **kwargs):
    """Helper method to create test posts with default values"""
    return self.neo4j_ops.neo4j_driver.execute_query(
        """
        CREATE (p:DiscoursePost)
        SET
            p.endpoint = $endpoint,
            p.postNumber = $post_number,
            p.createdAt = $created_at
            // ... other default properties
        RETURN p
        """,
        endpoint="wwwdwadeswdpoi123",
        post_number=post_number,
        created_at=created_at,
        **kwargs
    )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7c1f32a and c22cdef.

📒 Files selected for processing (4)
  • dags/hivemind_etl_helpers/src/db/discourse/fetch_raw_posts.py (1 hunks)
  • dags/hivemind_etl_helpers/src/db/discourse/utils/transform_raw_to_documents.py (2 hunks)
  • dags/hivemind_etl_helpers/src/db/telegram/transform/messages.py (4 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_discourse_fetch_posts.py (8 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • dags/hivemind_etl_helpers/src/db/telegram/transform/messages.py
🔇 Additional comments (5)
dags/hivemind_etl_helpers/src/db/discourse/utils/transform_raw_to_documents.py (1)

Line range hint 36-58: Verify URL construction across the codebase

Let's verify that the URL construction pattern is consistent with other parts of the codebase.

✅ Verification successful

URL construction pattern is consistent across the codebase

The verification shows that the URL construction pattern https://{forum_endpoint}/t/{topic_id}/{post_number} is consistently used throughout the codebase. This is evidenced by:

  • The implementation in transform_raw_to_documents.py
  • Multiple matching test cases in test_discourse_transform_raw_data.py that follow the same pattern
  • Consistent usage of forum_endpoint from the post dictionary
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for similar URL construction patterns in the codebase
rg -A 2 "https://.*?/t/.*?/" --type py

# Search for forum_endpoint usage
ast-grep --pattern 'post["forum_endpoint"]'

Length of output: 4543

dags/hivemind_etl_helpers/src/db/discourse/fetch_raw_posts.py (1)

58-58: LGTM! Verify field availability.

The addition of topic_id and post_number fields aligns with the PR objective to include URLs for documents. The field references are correct.

Let's verify that these fields are consistently available in the database:

Also applies to: 61-61

✅ Verification successful

Fields are used correctly for URL construction

Based on the evidence found, the topic_id and post_number fields are correctly used in URL construction (transform_raw_to_documents.py). The URL pattern follows the format: https://{forum_endpoint}/t/{topic_id}/{post_number}, which is a standard Discourse URL structure. Since these fields are essential for URL construction and are part of the core Discourse data model, they will be consistently available in the database.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that topic_id and postNumber are consistently present in the database
# to avoid potential null values in URL construction

ast-grep --pattern 'CREATE (p:DiscoursePost {$$$}) 
  SET p.postNumber = $_,
      p.topicId = $_'

Length of output: 100


Script:

#!/bin/bash
# Let's try a different approach to verify the field usage:
# 1. Check if these fields are consistently used in queries
# 2. Look for any null checks or error handling around these fields

# Check query patterns and field usage
rg -U "MATCH.*\(t:DiscourseTopic\).*\(p:DiscoursePost\).*RETURN.*t\.id.*p\.postNumber" --type py

# Check for any null handling or validation
rg "(?i)(topic_id|post_number).*(None|null|missing|empty)" --type py -A 2

# Look for any error handling related to these fields
rg "(?i)(topic_id|post_number).*(try|except|error|raise)" --type py -A 2

Length of output: 9955


Script:

#!/bin/bash
# Let's check the schema and data handling in the codebase:
# 1. Look for field definitions or schema declarations
# 2. Check how these fields are used in URL construction

# Check for schema or field definitions
rg -U "class.*DiscoursePost|class.*DiscourseTopic" --type py -A 10

# Look for URL construction using these fields
rg "(?i)(url|link).*\b(topic_id|post_number)\b" --type py -A 2

# Check the actual query usage in the codebase
rg -U "MATCH.*\(p:DiscoursePost\)" --type py

Length of output: 2835

dags/hivemind_etl_helpers/tests/integration/test_discourse_fetch_posts.py (3)

87-88: LGTM: Consistent with previous post number pattern

The second post correctly follows the same floating-point format (2.0) as the first post.


Line range hint 140-152: LGTM: Comprehensive assertions for second post

The assertions for the second post properly validate all fields including the new post_number field, maintaining consistency with the first post's assertions.


Line range hint 124-136: Verify post_number field handling in fetch_raw_posts function

The test assertions have been updated to include the new post_number field. Let's verify the implementation in the source function.

✅ Verification successful

Let's search for the implementation to verify the post_number field handling. The previous ast-grep pattern might not have matched due to different syntax or file type, so let's try a broader search approach.


Let's check the actual query implementation to verify how the post_number field is handled in the fetch_raw_posts function.


Let's try different search patterns to locate the Cypher query implementation in the fetch_raw_posts function.


Post number field is correctly implemented in fetch_raw_posts function

The Cypher query in the fetch_raw_posts function correctly handles the post_number field by retrieving it from the postNumber property of the DiscoursePost node:

p.postNumber as post_number

This matches the test assertion that expects the post_number field in the returned data.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for post_number field handling in fetch_raw_posts function
ast-grep --pattern 'RETURN {
  $$$
  post_number: p.postNumber,
  $$$
}'

Length of output: 74


Script:

#!/bin/bash
# Search for files containing fetch_raw_posts function
fd -e py -e cypher | xargs rg -l "fetch_raw_posts"

# After finding the files, let's examine the content
fd -e py -e cypher | xargs rg "fetch_raw_posts" -A 10 -B 10

Length of output: 28677


Script:

#!/bin/bash
# Let's examine the Cypher query in fetch_raw_posts function to verify post_number field
rg -A 50 "MATCH \(p:DiscoursePost\)" dags/hivemind_etl_helpers/src/db/discourse/fetch_raw_posts.py

Length of output: 100


Script:

#!/bin/bash
# Let's try a simpler search pattern to find the Cypher query
rg -A 50 "MATCH.*DiscoursePost" dags/hivemind_etl_helpers/src/db/discourse/fetch_raw_posts.py

# Also search for the query string definition
rg -B 5 -A 50 "query = " dags/hivemind_etl_helpers/src/db/discourse/fetch_raw_posts.py

Length of output: 8459

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py (2)

259-291: LGTM! Well-structured test for single document transformation.

The test thoroughly verifies the document transformation logic and metadata handling.

Remove the unnecessary f-string prefix:

-                "url": f"https://drive.google.com/file/d/file_1/view",
+                "url": "https://drive.google.com/file/d/file_1/view",
🧰 Tools
🪛 Ruff (0.8.0)

288-288: f-string without any placeholders

Remove extraneous f prefix

(F541)


292-373: LGTM! Comprehensive test coverage for multiple documents.

The test thoroughly verifies the transformation logic across multiple documents. However, there are a few improvements we can make:

  1. Remove unnecessary f-string prefixes:
-                "url": f"https://drive.google.com/file/d/file_1/view",
+                "url": "https://drive.google.com/file/d/file_1/view",
-                "url": f"https://drive.google.com/file/d/file_2/view",
+                "url": "https://drive.google.com/file/d/file_2/view",
-                "url": f"https://drive.google.com/file/d/file_3/view",
+                "url": "https://drive.google.com/file/d/file_3/view",
  1. Consider refactoring the test data setup to reduce duplication:
def create_test_document(doc_id, file_id, author):
    return Document(
        doc_id=doc_id,
        text="test",
        metadata={
            "file id": f"file_{file_id}",
            "author": f"author_{author}",
            "file name": f"file_name_{file_id}",
            "mime type": "mime",
            "created at": "date",
            "modified at": "modified",
        },
    )

# In test method:
documents = [
    create_test_document(i, i, i) for i in range(1, 4)
]
🧰 Tools
🪛 Ruff (0.8.0)

345-345: f-string without any placeholders

Remove extraneous f prefix

(F541)


358-358: f-string without any placeholders

Remove extraneous f prefix

(F541)


371-371: f-string without any placeholders

Remove extraneous f prefix

(F541)

dags/hivemind_etl_helpers/src/db/gdrive/gdrive_loader.py (1)

122-123: Type annotation compatibility with earlier Python versions

The type annotation str | None requires Python 3.10 or later. If the project needs to support earlier Python versions, consider using Optional[str] from the typing module for compatibility.

You can update the type annotation as follows:

-from typing import List, Optional
+from typing import List, Optional

     for doc in transformed_docs:
-        file_id: str | None = doc.metadata.get("file id")
+        file_id: Optional[str] = doc.metadata.get("file id")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ceecdc0 and 828158c.

📒 Files selected for processing (2)
  • dags/hivemind_etl_helpers/src/db/gdrive/gdrive_loader.py (2 hunks)
  • dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py

288-288: f-string without any placeholders

Remove extraneous f prefix

(F541)


345-345: f-string without any placeholders

Remove extraneous f prefix

(F541)


358-358: f-string without any placeholders

Remove extraneous f prefix

(F541)


371-371: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (1)
dags/hivemind_etl_helpers/src/db/gdrive/gdrive_loader.py (1)

53-54: Integration of document transformation step

The addition of the _transform_google_documents method to transform documents before returning them is a good enhancement. This ensures all documents include the necessary URL metadata.

dags/hivemind_etl_helpers/src/db/gdrive/gdrive_loader.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
dags/hivemind_etl_helpers/src/db/gdrive/gdrive_loader.py (2)

119-128: LGTM with suggestions for improvement

The implementation correctly addresses the deep copy concern from the previous review and properly handles the URL construction.

Consider extracting the URL format as a class constant for better maintainability:

 class GoogleDriveLoader:
+    GDRIVE_FILE_URL_FORMAT = "https://drive.google.com/file/d/{}/view"
     
     def __init__(self, refresh_token: str):
         self.refresh_token = refresh_token
         
     def _transform_google_documents(self, documents: list[Document]) -> list[Document]:
         transformed_docs: list[Document] = copy.deepcopy(documents)
         for doc in transformed_docs:
             file_id: str | None = doc.metadata.get("file id")
             if file_id is None:
                 doc.metadata["url"] = None
             else:
-                doc.metadata["url"] = f"https://drive.google.com/file/d/{file_id}/view"
+                doc.metadata["url"] = self.GDRIVE_FILE_URL_FORMAT.format(file_id)

115-118: Enhance method documentation

The docstring could be more descriptive about the method's behavior and return value.

Consider updating the docstring:

     def _transform_google_documents(self, documents: list[Document]) -> list[Document]:
         """
-        transform google extracted documents by inserting their metadata a url
+        Transform Google Drive documents by adding URL to their metadata.
+        
+        Args:
+            documents: List of Document objects with Google Drive metadata
+        
+        Returns:
+            List of transformed Document objects with 'url' added to metadata
         """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 828158c and 3e8f740.

📒 Files selected for processing (1)
  • dags/hivemind_etl_helpers/src/db/gdrive/gdrive_loader.py (3 hunks)
🔇 Additional comments (2)
dags/hivemind_etl_helpers/src/db/gdrive/gdrive_loader.py (2)

1-1: LGTM: Import added correctly

The copy module import is appropriately placed and necessary for the deep copying functionality.


53-55: LGTM: Document transformation properly integrated

The transformation step is correctly placed after all documents are loaded but before returning. This ensures consistent processing of all documents regardless of their source (folders, drives, or files).

Let's verify that all callers of load_data can handle the modified document structure:

✅ Verification successful

LGTM: Document transformation is safely integrated

The transformation is correctly implemented and the changes are safe because:

  1. The URL field is only added to the document metadata, not modifying any existing fields
  2. The transformation is properly tested as shown in the test files
  3. All callers of load_data receive the documents through the Document class interface and don't make assumptions about specific metadata fields
  4. The URL field is safely handled with a fallback to None when file_id is missing
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for all callers of load_data to ensure they handle the new URL field
rg -l "load_data\(" | xargs rg "metadata\[.?['\"]url['\"]"

Length of output: 309


Script:

#!/bin/bash
# Let's check the actual transformation logic and its usage
ast-grep --pattern 'def _transform_google_documents($$$)'

Length of output: 59


Script:

#!/bin/bash
# Let's check the implementation with ripgrep since ast-grep didn't work
rg -A 10 "_transform_google_documents" 

Length of output: 4484


Script:

#!/bin/bash
# Let's check where the transformed documents are used
rg -B 5 -A 5 "load_data\(" dags/hivemind_etl_helpers/src/db/gdrive/

Length of output: 6005

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py (2)

121-147: LGTM! Consider extracting test data

The expected results correctly include the new URL field with proper Google Drive URL format. Consider extracting the test data into a separate fixture or helper method to improve test maintainability.

def create_test_document(file_id: str, author: str) -> Document:
    return Document(
        id_=f"{file_id}.docx",
        metadata={
            "file_name": f"{file_id}.docx",
            "file id": file_id,
            "author": author,
            "file name": "Option",
            "url": f"https://drive.google.com/file/d/{file_id}/view",
        },
        text="Option 1: Keep it super casual"
    )

284-316: Remove unnecessary f-string prefix

The test is well-structured and validates the document transformation correctly. However, there's an unnecessary f-string prefix that should be removed.

-                "url": f"https://drive.google.com/file/d/file_1/view",
+                "url": "https://drive.google.com/file/d/file_1/view",
🧰 Tools
🪛 Ruff (0.8.0)

313-313: f-string without any placeholders

Remove extraneous f prefix

(F541)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3e8f740 and a4325e1.

📒 Files selected for processing (1)
  • dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py

313-313: f-string without any placeholders

Remove extraneous f prefix

(F541)


370-370: f-string without any placeholders

Remove extraneous f prefix

(F541)


383-383: f-string without any placeholders

Remove extraneous f prefix

(F541)


396-396: f-string without any placeholders

Remove extraneous f prefix

(F541)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py (1)

121-147: LGTM! Consider extracting URL construction logic

The updated test correctly validates the new URL field in document metadata. However, the URL construction pattern is duplicated across multiple test cases.

Consider extracting the URL construction into a helper method:

def _build_gdrive_url(file_id: str) -> str:
    return f"https://drive.google.com/file/d/{file_id}/view"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a4325e1 and d261e67.

📒 Files selected for processing (1)
  • dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py (2 hunks)
🔇 Additional comments (1)
dags/hivemind_etl_helpers/tests/integration/test_gdrive_loader.py (1)

284-398: 🛠️ Refactor suggestion

Refactor tests using parameterized testing

While the tests thoroughly validate the document transformation functionality, there's significant duplication in test data and assertions.

As suggested in the previous review, consider using parameterized testing to improve maintainability. The pattern of creating test documents and validating their metadata is repeated multiple times.

import pytest

@pytest.mark.parametrize("doc_id,file_id,author", [
    (1, "file_1", "author_1"),
    (2, "file_2", "author_2"),
    (3, "file_3", "author_3"),
])
def test_transform_document(self, doc_id, file_id, author):
    loader = GoogleDriveLoader(refresh_token=self.refresh_token)
    
    document = Document(
        doc_id=doc_id,
        text="test",
        metadata={
            "file id": file_id,
            "author": author,
            "file name": f"file_name_{doc_id}",
            "mime type": "mime",
            "created at": "date",
            "modified at": "modified",
        },
    )
    
    transformed_docs = loader._transform_google_documents(documents=[document])
    
    expected_metadata = {
        "file id": file_id,
        "author": author,
        "file name": f"file_name_{doc_id}",
        "mime type": "mime",
        "created at": "date",
        "modified at": "modified",
        "url": f"https://drive.google.com/file/d/{file_id}/view",
    }
    
    assert len(transformed_docs) == 1
    assert transformed_docs[0].metadata == expected_metadata

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
dags/hivemind_etl_helpers/notion_etl.py (1)

96-120: Consider enhancing URL handling robustness

While the implementation is functional, consider these improvements:

  1. Extract the Notion base URL as a class constant or configuration value
  2. Add URL encoding for the page_id to handle special characters
  3. Add example metadata structure in the docstring

Here's a suggested improvement:

 class NotionProcessor:
+    NOTION_BASE_URL = "https://www.notion.so"
+
     def __init__(
         self,
         community_id: str,
         access_token: str | None = None,
     ) -> None:
         # ... existing code ...

     def _transform_documents(self, documents: list[Document]) -> list[Document]:
         """
         transform notion extracted documents by inserting their metadata a url

         Parameters
         ------------
         documents : list[Document]
             a list of notion extracted pages

         Returns
         ---------
         documents : list[Document]
             a list of documents each included with url in its metadata

+        Example
+        --------
+        Input document metadata: {"page_id": "123abc", "title": "My Page"}
+        Output document metadata: {"page_id": "123abc", "title": "My Page", "url": "https://www.notion.so/123abc"}
         """
         transformed_docs: list[Document] = copy.deepcopy(documents)

         for doc in transformed_docs:
             page_id: str | None = doc.metadata.get("page_id")
             if page_id is None:
                 doc.metadata["url"] = None
             else:
-                doc.metadata["url"] = f"https://www.notion.so/{page_id}"
+                # URL encode the page_id to handle special characters
+                encoded_page_id = urllib.parse.quote(page_id)
+                doc.metadata["url"] = f"{self.NOTION_BASE_URL}/{encoded_page_id}"

         return transformed_docs

Don't forget to add the import:

import urllib.parse
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d261e67 and 6a3316c.

📒 Files selected for processing (1)
  • dags/hivemind_etl_helpers/notion_etl.py (4 hunks)
🔇 Additional comments (4)
dags/hivemind_etl_helpers/notion_etl.py (4)

61-62: LGTM: Clean integration of document transformation

The transformation step is properly placed before pipeline execution, maintaining the existing flow while adding URL metadata.


77-78: LGTM: Consistent implementation of document transformation

The transformation step is consistently implemented across methods, ensuring uniform URL metadata addition.


93-94: LGTM: Maintains consistency in document transformation

The transformation step is properly integrated, maintaining consistency with other processing methods.


96-120: Verify URL handling consistency across document types

Let's ensure the URL handling is consistent with other document types in the system.

✅ Verification successful

URL handling is consistent with other document sources

The URL handling in the Notion ETL is consistent with other document sources in the codebase. Specifically:

  • Google Drive documents follow the same pattern of URL handling in metadata, using a similar structure: doc.metadata["url"] = f"https://drive.google.com/file/d/{file_id}/view"
  • Both implementations handle the case where the ID is missing by setting doc.metadata["url"] = None
  • The URL construction follows a consistent pattern across document sources, where each source uses its specific base URL with the respective ID
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check URL handling patterns across different document types

# Check for URL field usage patterns
echo "Checking URL field usage patterns:"
rg -g '!*.pyc' -g '!*test*' 'metadata\[.url.|\.url\s*=' --type py

# Check for Notion URL construction patterns
echo "Checking Notion URL construction patterns:"
rg -g '!*.pyc' 'notion\.so' --type py

# Check for URL encoding usage
echo "Checking URL encoding usage:"
rg -g '!*.pyc' 'urllib\.parse\.quote' --type py

Length of output: 1278


Script:

#!/bin/bash
# Let's check the Document class definition and its usage patterns
echo "Checking Document class definition and metadata handling:"
ast-grep --pattern 'class Document'

# Let's also check for any document transformation patterns
echo "Checking document transformation patterns:"
rg -g '!*.pyc' '_transform.*document' --type py -A 5

# Check for metadata field patterns across the codebase
echo "Checking metadata field patterns:"
rg -g '!*.pyc' 'metadata\[".+"\]' --type py

Length of output: 25165

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
dags/hivemind_etl_helpers/notion_etl.py (2)

98-108: Fix typo in docstring.

There's a typo in the docstring: "inlcuded" should be "included".

-            a list of documents each inlcuded with url in its metadata
+            a list of documents each included with url in its metadata

113-118: Consider enhancing URL construction robustness.

The current URL construction could be improved in several ways:

  1. Consider validating the page_id format before URL construction
  2. The Notion domain is hardcoded - consider making it configurable
  3. URL construction might need URL-safe encoding for the page_id

Here's a suggested improvement:

+    NOTION_BASE_URL = "https://www.notion.so"
+
     def _transform_documents(self, documents: list[Document]) -> list[Document]:
         transformed_docs: list[Document] = copy.deepcopy(documents)
 
         for doc in transformed_docs:
             page_id: str | None = doc.metadata.get("page_id")
-            if page_id is None:
+            if not page_id or not isinstance(page_id, str):
                 doc.metadata["url"] = None
             else:
-                doc.metadata["url"] = f"https://www.notion.so/{page_id}"
+                from urllib.parse import quote
+                safe_page_id = quote(page_id)
+                doc.metadata["url"] = f"{self.NOTION_BASE_URL}/{safe_page_id}"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 6a3316c and 7d2191e.

📒 Files selected for processing (1)
  • dags/hivemind_etl_helpers/notion_etl.py (4 hunks)
🔇 Additional comments (2)
dags/hivemind_etl_helpers/notion_etl.py (2)

61-62: LGTM! Consistent transformation step added before ingestion.

The changes consistently add document transformation before pipeline ingestion across all processing methods, maintaining good code organization and clarity.

Also applies to: 77-78, 93-94


96-120: Verify URL format compatibility with dependent systems.

The new URL metadata field might be consumed by other components. Let's verify the compatibility.

✅ Verification successful

URL format is consistent with existing implementations

The codebase search reveals that the URL metadata field is used in a similar pattern in gdrive_loader.py, where it's either set to None or a valid URL. The Notion URL format follows the same pattern:

  • Both implementations handle the case of missing identifiers by setting URL to None
  • Both use standard HTTPS URLs with domain-specific formats
  • No other components are accessing this metadata field
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any code that might be accessing the URL field in document metadata
rg -g '!*.pyc' -g '!*test*' 'metadata\[.url.\]|metadata\.get\(.url.\)'

# Search for any tests that might need updating for the new URL field
rg -g '*test*.py' 'metadata\[.url.\]|metadata\.get\(.url.\)'

Length of output: 576

@amindadgar amindadgar merged commit 8ae632f into main Nov 28, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Update platform ETLs to include document links
1 participant