Why are we decoding the URLs before parsing #7

larskraemer · 2020-06-06T18:46:07Z

Before doing any work in the parser, we are decoding the URL, i.e. replacing "%ab" with '\xab'.
I think it would be better to this after parsing the URL, since the following URL, for example, would produce incorrect results:

https://example.com/test%3Ftest (Note: 0x3F is ASCII for '?')

If this URL is decoded first, then parsed, it will be parsed as "https://example.com/test" with a query string of "test". This is not the behavior any browser is going to give you, and I believe it should not be the behavior of this program.
Instead, I believe we should decode the parts of the URL separately after parsing, probably even while assembling the URL key.

The text was updated successfully, but these errors were encountered:

ameenmaali · 2020-06-06T19:12:18Z

Hey @larskraemer, thanks for the note. This is a great point. I built this in mind with pulling URLs from multiple different tools, some of which I do know may have encoded or decoded results. However, you make a good point that’s there’s a risk of getting incorrect results here with something like the following example:

https://site.com/redirect?url=https://google.com%3Furl=anothersite.com%26code=302&code=200

After thinking about it, I’m not even sure there’s much value in decoding at all. I’m going to check on some of the popular tools out there that generate these URL lists (such as waybackurls & gau) to understand the output and determine if encoding and decoding is even relevant anymore

larskraemer · 2020-06-06T19:57:42Z

Yea, I thought it might have something to do with interoperability. I think in general, we should be handling URLs as they would be typed into the browser, and maybe add a switch to decode first, or even just a seperate decoder tool. I just reworked the parsing a bit in anticipation of #6 and got a pretty significant performance boost from just not decoding at all. (Also, 5 times over regex, yay)
Decoding might be sensible for these kinds of URLs:

example.com/test?%61=b
example.com/test?a=b

These query strings are absolutely identical to a server, so we should treat them as identical too. Same for unnecessarily encoded parts of the path.

larskraemer · 2020-06-06T21:12:15Z

I just noticed, this decoding business is going to mess up the deduplication too, even if we decode only the parts separately:
example.com/test?a=b
example.com/test?a=b%26c%3Dd

The second query string decodes to ?a=b&c=d, which will make it appear in the deduped results.
At this point, we might have to change the way we are eliminating duplicates completely

ameenmaali · 2020-06-07T21:58:49Z

Decoding & encoding logic was removed, closing out

ameenmaali closed this as completed Jun 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are we decoding the URLs before parsing #7

Why are we decoding the URLs before parsing #7

larskraemer commented Jun 6, 2020

ameenmaali commented Jun 6, 2020

larskraemer commented Jun 6, 2020 •

edited

Loading

larskraemer commented Jun 6, 2020

ameenmaali commented Jun 7, 2020

Why are we decoding the URLs before parsing #7

Why are we decoding the URLs before parsing #7

Comments

larskraemer commented Jun 6, 2020

ameenmaali commented Jun 6, 2020

larskraemer commented Jun 6, 2020 • edited Loading

larskraemer commented Jun 6, 2020

ameenmaali commented Jun 7, 2020

larskraemer commented Jun 6, 2020 •

edited

Loading