-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate duplicates that are not in query strings #6
Comments
This functionality would probably need to be added as a switch, since, if we consider somesite.com/product/1 and somesite.com/product/2 to be the same, we would also need to consider I'd propse adding a switch to the program to tell it which part of the path is irrelevant, like so: |
Yeah, I can understand. I think it could be feasible if it is checked only in case there is a digit. 1 is likely similar to 2 so one of them could be discarded. Two different strings have probably different meaning so they can be kept. |
Thanks for the suggestion @simonebovi! I wanted to release this in the initial release but didn’t have much time to get it done. I have started working on this already actually - I’m focusing on integer differences only, as anything else would be very hard without context and most likely cause more issues than it would solve unless done extremely well |
Just added a PR to account for this. It checks for common assets to ignore (images, fonts), as well as integers in URLs. Tested it out and seems to be working well. @larskraemer, if you want to take a look: #9 |
PR Merged, feel free to give it a test @simonebovi! |
Thanks a lot @ameenmaali, It seems to work way better now. I just did some tests and I have found that this can be improved again to me. For example, these URLs are still maintained:
These are basically the same so I think only one of them should be kept. However a URL like that should be kept as well:
Not sure if this is feasible though. What do you think? |
Thank you, I will check this out shortly. This should already be accounted for - may be a bug |
Hey @simonebovi, just tried to test this out, and it seems to be working as expected. Can you try again to verify? Make sure to enable the |
Hi,
thanks a lot for this tool, it is very useful!
I was wondering if it would be possible to implement also a dedupe functionality for this kind of URL:
This should results just in:
It seems to me that at this time this is not taken in consideration.
I would really like to contribute on this by myself but my C++ knowledge are really rusty :)
Thanks again!
The text was updated successfully, but these errors were encountered: