Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove similar URLs, particularly integer checks and images #9

Merged
merged 3 commits into from
Jun 7, 2020

Conversation

ameenmaali
Copy link
Owner

Add more deduplication logic with -s switch for similar URLs. Issue at #6

@larskraemer
Copy link
Contributor

Two small suggestions, both for get_path_components:

  • You're erasing the /'s of the path completely. This will make test.com/a/b and test.com/ab the same URL in this mode. Maybe just keep the /'s in to avoid this.
  • For performance reasons, you might want to make url_path and token both string_views and use a strategy similar to the one in parse to cut up the path. Changing a string_view is much faster than erase-ing from a string

@ameenmaali
Copy link
Owner Author

Nice catch @larskraemer! - I addressed the first one. I'll merge this in, and rework it a bit to use string_view in the next couple days.

@ameenmaali ameenmaali merged commit 515c855 into master Jun 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants