Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are we decoding the URLs before parsing #7

Closed
larskraemer opened this issue Jun 6, 2020 · 4 comments
Closed

Why are we decoding the URLs before parsing #7

larskraemer opened this issue Jun 6, 2020 · 4 comments

Comments

@larskraemer
Copy link
Contributor

Before doing any work in the parser, we are decoding the URL, i.e. replacing "%ab" with '\xab'.
I think it would be better to this after parsing the URL, since the following URL, for example, would produce incorrect results:

https://example.com/test%3Ftest (Note: 0x3F is ASCII for '?')

If this URL is decoded first, then parsed, it will be parsed as "https://example.com/test" with a query string of "test". This is not the behavior any browser is going to give you, and I believe it should not be the behavior of this program.
Instead, I believe we should decode the parts of the URL separately after parsing, probably even while assembling the URL key.

@ameenmaali
Copy link
Owner

Hey @larskraemer, thanks for the note. This is a great point. I built this in mind with pulling URLs from multiple different tools, some of which I do know may have encoded or decoded results. However, you make a good point that’s there’s a risk of getting incorrect results here with something like the following example:

https://site.com/redirect?url=https://google.com%3Furl=anothersite.com%26code=302&code=200

After thinking about it, I’m not even sure there’s much value in decoding at all. I’m going to check on some of the popular tools out there that generate these URL lists (such as waybackurls & gau) to understand the output and determine if encoding and decoding is even relevant anymore

@larskraemer
Copy link
Contributor Author

larskraemer commented Jun 6, 2020

Yea, I thought it might have something to do with interoperability. I think in general, we should be handling URLs as they would be typed into the browser, and maybe add a switch to decode first, or even just a seperate decoder tool. I just reworked the parsing a bit in anticipation of #6 and got a pretty significant performance boost from just not decoding at all. (Also, 5 times over regex, yay)
Decoding might be sensible for these kinds of URLs:

example.com/test?%61=b
example.com/test?a=b

These query strings are absolutely identical to a server, so we should treat them as identical too. Same for unnecessarily encoded parts of the path.

@larskraemer
Copy link
Contributor Author

I just noticed, this decoding business is going to mess up the deduplication too, even if we decode only the parts separately:
example.com/test?a=b
example.com/test?a=b%26c%3Dd

The second query string decodes to ?a=b&c=d, which will make it appear in the deduped results.
At this point, we might have to change the way we are eliminating duplicates completely

@ameenmaali
Copy link
Owner

Decoding & encoding logic was removed, closing out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants