Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement content sniffing for HTML parsing #808

Merged
merged 2 commits into from
Mar 27, 2024

Conversation

WGH-
Copy link
Collaborator

@WGH- WGH- commented Mar 25, 2024

Web pages can be served without Content-Type set, in which case browsers employ content sniffing. Do the same here, in Colly.

While we're at it, change the Content-Type check to something stricter than mere "html" substring match.

@WGH- WGH- force-pushed the content-sniffing branch from 69cc94a to 40d3e41 Compare March 25, 2024 21:30
@WGH-
Copy link
Collaborator Author

WGH- commented Mar 25, 2024

Welp, strings.Cut appeared only in Go 1.18. Instead of rewriting it the old way I decided to drop old Go versions (#810).

WGH- added 2 commits March 27, 2024 17:57
Instead of looking for "html" substring, actually parse the MIME type
string. Don't use mime.ParseMediaType though as it doesn't handle
invalid duplicate parameters (e.g. "text/html; charset=UTF-8; charset=utf-8")
that occur in the wild.
Web pages can be served without Content-Type set, in which case
browsers employ content sniffing. Do the same here, in Colly.
@WGH- WGH- force-pushed the content-sniffing branch from 40d3e41 to bad50ff Compare March 27, 2024 14:57
@WGH- WGH- marked this pull request as ready for review March 27, 2024 15:02
@WGH- WGH- requested a review from asciimoo March 27, 2024 15:07
@asciimoo asciimoo merged commit 5224b97 into gocolly:master Mar 27, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants