-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why are we decoding the URLs before parsing #7
Comments
Hey @larskraemer, thanks for the note. This is a great point. I built this in mind with pulling URLs from multiple different tools, some of which I do know may have encoded or decoded results. However, you make a good point that’s there’s a risk of getting incorrect results here with something like the following example: https://site.com/redirect?url=https://google.com%3Furl=anothersite.com%26code=302&code=200 After thinking about it, I’m not even sure there’s much value in decoding at all. I’m going to check on some of the popular tools out there that generate these URL lists (such as waybackurls & gau) to understand the output and determine if encoding and decoding is even relevant anymore |
Yea, I thought it might have something to do with interoperability. I think in general, we should be handling URLs as they would be typed into the browser, and maybe add a switch to decode first, or even just a seperate decoder tool. I just reworked the parsing a bit in anticipation of #6 and got a pretty significant performance boost from just not decoding at all. (Also, 5 times over regex, yay) example.com/test?%61=b These query strings are absolutely identical to a server, so we should treat them as identical too. Same for unnecessarily encoded parts of the path. |
I just noticed, this decoding business is going to mess up the deduplication too, even if we decode only the parts separately: The second query string decodes to ?a=b&c=d, which will make it appear in the deduped results. |
Decoding & encoding logic was removed, closing out |
Before doing any work in the parser, we are decoding the URL, i.e. replacing "%ab" with '\xab'.
I think it would be better to this after parsing the URL, since the following URL, for example, would produce incorrect results:
https://example.com/test%3Ftest
(Note: 0x3F is ASCII for '?')If this URL is decoded first, then parsed, it will be parsed as "https://example.com/test" with a query string of "test". This is not the behavior any browser is going to give you, and I believe it should not be the behavior of this program.
Instead, I believe we should decode the parts of the URL separately after parsing, probably even while assembling the URL key.
The text was updated successfully, but these errors were encountered: