Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate duplicates that are not in query strings #6

Closed
ghost opened this issue Jun 6, 2020 · 8 comments
Closed

Eliminate duplicates that are not in query strings #6

ghost opened this issue Jun 6, 2020 · 8 comments

Comments

@ghost
Copy link

ghost commented Jun 6, 2020

Hi,
thanks a lot for this tool, it is very useful!

I was wondering if it would be possible to implement also a dedupe functionality for this kind of URL:

  • /product/1/buy/1
  • /product/1/buy/2
  • /product/1/
  • /product/2/

This should results just in:

  • /product/1/buy/1
  • /product/1/

It seems to me that at this time this is not taken in consideration.

I would really like to contribute on this by myself but my C++ knowledge are really rusty :)

Thanks again!

@larskraemer
Copy link
Contributor

This functionality would probably need to be added as a switch, since, if we consider somesite.com/product/1 and somesite.com/product/2 to be the same, we would also need to consider
somesite.com/something/login and somesite.com/something/profile to be the same.
Anything else would need some heuristic for determining if the difference in the path is relevant or not, which would be complicated, error-prone and (probably) slow.

I'd propse adding a switch to the program to tell it which part of the path is irrelevant, like so:
-i 2 would make ex.com/product/1/whatever the same as ex.com/product/x/whatever
-i 2 -i 3 would make ex.com/product/1/2 the same as ex.com/product/x/y

@ghost
Copy link
Author

ghost commented Jun 6, 2020

Yeah, I can understand.

I think it could be feasible if it is checked only in case there is a digit.

1 is likely similar to 2 so one of them could be discarded.

Two different strings have probably different meaning so they can be kept.

@ameenmaali
Copy link
Owner

Thanks for the suggestion @simonebovi! I wanted to release this in the initial release but didn’t have much time to get it done. I have started working on this already actually - I’m focusing on integer differences only, as anything else would be very hard without context and most likely cause more issues than it would solve unless done extremely well

@ameenmaali
Copy link
Owner

Just added a PR to account for this. It checks for common assets to ignore (images, fonts), as well as integers in URLs. Tested it out and seems to be working well. @larskraemer, if you want to take a look: #9

@ameenmaali
Copy link
Owner

PR Merged, feel free to give it a test @simonebovi!

@ghost
Copy link
Author

ghost commented Jun 7, 2020

Thanks a lot @ameenmaali,
I really love open source projects!

It seems to work way better now.

I just did some tests and I have found that this can be improved again to me.

For example, these URLs are still maintained:

  • product/1/buy/1
  • product/1/buy/2
  • product/1/buy/3

These are basically the same so I think only one of them should be kept.

However a URL like that should be kept as well:

  • product/1/sell/1

Not sure if this is feasible though.

What do you think?

@ameenmaali
Copy link
Owner

Thank you, I will check this out shortly. This should already be accounted for - may be a bug

@ameenmaali
Copy link
Owner

Hey @simonebovi, just tried to test this out, and it seems to be working as expected. Can you try again to verify? Make sure to enable the -s flag when running. One thing I do know is an issue is a lack of checking for ports in the URL so that may cause 2 of the same (with different or missing port numbers) to show up. I will add that in a future update (#10)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants