Implement content sniffing for HTML parsing #808

WGH- · 2024-03-25T18:50:08Z

Web pages can be served without Content-Type set, in which case browsers employ content sniffing. Do the same here, in Colly.

While we're at it, change the Content-Type check to something stricter than mere "html" substring match.

WGH- · 2024-03-25T22:12:18Z

Welp, strings.Cut appeared only in Go 1.18. Instead of rewriting it the old way I decided to drop old Go versions (#810).

Instead of looking for "html" substring, actually parse the MIME type string. Don't use mime.ParseMediaType though as it doesn't handle invalid duplicate parameters (e.g. "text/html; charset=UTF-8; charset=utf-8") that occur in the wild.

Web pages can be served without Content-Type set, in which case browsers employ content sniffing. Do the same here, in Colly.

WGH- force-pushed the content-sniffing branch from 69cc94a to 40d3e41 Compare March 25, 2024 21:30

WGH- added 2 commits March 27, 2024 17:57

Improve Content-Type parsing

31f0876

Instead of looking for "html" substring, actually parse the MIME type string. Don't use mime.ParseMediaType though as it doesn't handle invalid duplicate parameters (e.g. "text/html; charset=UTF-8; charset=utf-8") that occur in the wild.

Implement content sniffing for HTML parsing

bad50ff

Web pages can be served without Content-Type set, in which case browsers employ content sniffing. Do the same here, in Colly.

WGH- force-pushed the content-sniffing branch from 40d3e41 to bad50ff Compare March 27, 2024 14:57

WGH- marked this pull request as ready for review March 27, 2024 15:02

WGH- requested a review from asciimoo March 27, 2024 15:07

asciimoo merged commit 5224b97 into gocolly:master Mar 27, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement content sniffing for HTML parsing #808

Implement content sniffing for HTML parsing #808

WGH- commented Mar 25, 2024

WGH- commented Mar 25, 2024

Implement content sniffing for HTML parsing #808

Implement content sniffing for HTML parsing #808

Conversation

WGH- commented Mar 25, 2024

WGH- commented Mar 25, 2024