Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option to decode percent-encoded URLs #377

Open
pabs3 opened this issue Nov 29, 2024 · 2 comments
Open

option to decode percent-encoded URLs #377

pabs3 opened this issue Nov 29, 2024 · 2 comments

Comments

@pabs3
Copy link
Contributor

pabs3 commented Nov 29, 2024

In my monitoring of ArchiveBot I often have to deal with URLs that consist of percent-encoded junk.

I would like to be able to decode the junk with trurl, since it is a convenient tool for wrangling URLs on the command-line.

There doesn't appear to be a way to get trurl to decode the full URL as percent-encoded data. It only seems to do that when extracting query parameters, or for the JSON output.

Here is an example that I had to deal with recently:

$ echo 'https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg%20img%5D%0A%0A%5Bimg%5Dhttps:/live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg' |
  sed 's@^@https://foo.com/?url=@' |
  trurl -f - -g {query:url} | tee urls
https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg img]

[img]https:/live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg
$ cat urls | sed -E 's/ *\[?img\]? *//g' | trurl -f -
https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg
https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg

I propose the solution to this situation would one of these two options:

Change the --get option to also URL decode the {url} component (as currently implied by the documentation (The following component names are available (case sensitive): url, ... Components are shown URL decoded by default.), and require --urlencode to get the non-decoded version.

Update the --get documentation to mention that the {url} component is not URL decoded and then add a --urldecode option to get the URL decoded version of it. This could also be used without the --get option as well.

@jacobmealey
Copy link
Contributor

jacobmealey commented Nov 29, 2024

Trurl tries to ensure that if it outputs a whole url, that URL is always valid. The percent encoded characters in your URL are not valid when decoded. an example of trurl decoding the url can be seen at the top of the man page in the normalization section:

$ trurl 'http://ex%61mple:80/%62ath/a/../b?%2e%FF#tes%74'
 http://example/bath/b?.%ff#test

If we try appending a valid percent encoded value to your url, for example %41 (capital 'A'), we get:

$ trurl 'https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg%20img%5D%0A%0A%5Bimg%5Dhttps:/live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg%41' 
https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg%20img%5d%0a%0a%5bimg%5dhttps%3a/live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpgA

where you can see its showing the decoded A at the end of the url.

A good solution for you may be to utilize the --json options which gives the following output:

[
  {
    "url": "https://live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg%20img%5d%0a%0a%5bimg%5dhttps%3a/live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg",
    "parts": {
      "scheme": "https",
      "host": "live.staticflickr.com",
      "path": "/65535/49752865666_d5b24db0ed_c.jpg img]\n\n[img]https:/live.staticflickr.com/65535/49752865666_d5b24db0ed_c.jpg"
    }
  }
]

@pabs3
Copy link
Contributor Author

pabs3 commented Nov 30, 2024

Thanks for the explanation.

In my case I want the invalid characters rather than their
percent-encoded equivalent, so that I can more easily process
the data as multiple URLs instead of just one.

As mentioned in the initial post, I know about the JSON option,
but it is not as convenient as a single option could be, since I then need to write a jq program to join up the scheme, hostname and path as appropriate.

So maybe what I want is a --keep-invalid-characters feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants