provide an easy way to validate an entire document and not only its body #2044

robstoll · 2023-11-09T10:08:44Z

consider the following:

Document document = Parser.htmlParser().parseInput("<html><head><script>alert('xss');</script></head><body>all good</body></html>", "");
Cleaner(Safelist.none().addTags("title")).isValid(document);

returns false but not because we added script to head but because there is this check in isValid:

&& dirtyDocument.head().childNodes().isEmpty(); // because we only look at the body, but we start from a shell, make sure there's nothing in the head

I would have expected isValid analyses everything as it a) takes a Document and b) is not named isValidBodyHtml in contrast to the other available method.
Since CleaningVisitor is private, I cannot pass html element instead of the body. Suggestions how I can achieve what I want?

(I suggest you deprecate isValid and add another overload isValidBodyHtml which takes a Document)

The text was updated successfully, but these errors were encountered:

jhy · 2023-11-10T01:32:06Z

Yes, the Cleaner only takes content from the body. Per the class documentation:

It is assumed that the input HTML is a body fragment; the clean methods only pull from the source's body, and the
canned safe-lists only allow body contained tags.

I'm not sure what you mean by

returns false but not because we added script to head

It is returning false because there is content in the head.

(I suggest you deprecate isValid and add another overload isValidBodyHtml which takes a Document)

But in that case the name would imply it is vetting the HTML which it would not be. The isValidBodyHtml checks both that there are not parser errors and that nothing was removed by the Cleaner. The isValid(Document) can only test that the Cleaner has not removed anything.

Taking a step back, can you tell me more about what you're trying to achieve, what your input HTML is, and what you consider valid / invalid?

In a world where the Cleaner does support full documents (which I am happy to consider supporting at some point), it seems that your configured Cleaner (just supporting "title") tags would always be invalid, as it will always contain html, head, body.

jhy · 2023-11-10T01:33:55Z

Ref also #1520

jhy · 2023-11-10T01:51:17Z

(BTW, thank you for the feedback!)

Maybe an approach would be that the Cleaner checks what tags are in the safelist. And if it sees {html | head | body}, act on the whole document, and if not, act as now in body fragment mode.

That way there's not another method or configurator, and there's no deprecation. Existing implementations keep the same behavior (body fragment) as adding non-body tags doesn't work currently.

robstoll · 2023-11-10T08:49:14Z

Sorry, should have wrote down the way I came here which was kind of via three corners. Because, in fact, I don't need whole document validation for my current use case I was only quite confused just looking at the method isValid(Document) to find out it only checks the body.

how I came here:

I need to parse HTML to process it further
I still want to check that it is valid in terms of used nodes, attributes
I don't want to parse twice => I create the Parser by my own instead via Jsoup...
parser.parseInput(" will be in body") => first figured Safelist.none() does not report on comments correctly but it does as long as it is in the body
now realised that Cleaner.isValid(Document) only validates the body

IMO we can close this issue. Maybe we should create a new one which improves the check in Cleaner.isValid
so that it also catches if comments are define before the domlike:

// because we only look at the body, but we start from a shell, make sure there's nothing in the head...
&& dirtyDocument.head().childNodes().isEmpty()
// and neither before the HTML 
&& dirtyDocument.childNodes().size == 1

We might want to allow DocumentType before HTML? But maybe it would be better to adjust parseInput so that the comment lands in the body instead of before the <html> tag?

In any case, I use now parseFragmentInput instead and no longer suffer this problem.
I hope this gives more insight why I created the issue and somewhat sorry for the noise, feel free to close the issue.

jhy · 2023-11-15T23:48:03Z

Thanks for the detail -- makes sense and helps understand the requirement.

I do want to overhaul the Cleaner interface to enable optionally cleaning whole documents vs body fragments. Thinking that might simply be enabled by adding head elements to the safelist - if those are present, flip to document vs body fragment mode. And also adding more control of what happens to dropped elements - unwrapped, removed, escaped, etc. (See #2050).

Will leave this as a tracker bug for now while planning the revision.

jhy · 2023-11-16T00:03:12Z

I think another requirement here is that in one parse you can get both the cleaned Document and also the isValid response (the combination of if the parser saw any errors, and if any elements / attributes were dropped during cleaning).

Another ask I've seen is for a report of what elements / attributes were dropped.

Thinking then that there's a more detailed response object required from the cleaner that can provide that level of detail along with the response document.

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provide an easy way to validate an entire document and not only its body #2044

provide an easy way to validate an entire document and not only its body #2044

robstoll commented Nov 9, 2023 •

edited

Loading

jhy commented Nov 10, 2023

jhy commented Nov 10, 2023

jhy commented Nov 10, 2023

robstoll commented Nov 10, 2023 •

edited by jhy

Loading

jhy commented Nov 15, 2023

jhy commented Nov 16, 2023

This comment was marked as off-topic.

provide an easy way to validate an entire document and not only its body #2044

provide an easy way to validate an entire document and not only its body #2044

Comments

robstoll commented Nov 9, 2023 • edited Loading

jhy commented Nov 10, 2023

jhy commented Nov 10, 2023

jhy commented Nov 10, 2023

robstoll commented Nov 10, 2023 • edited by jhy Loading

jhy commented Nov 15, 2023

jhy commented Nov 16, 2023

This comment was marked as off-topic.

robstoll commented Nov 9, 2023 •

edited

Loading

robstoll commented Nov 10, 2023 •

edited by jhy

Loading