-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
provide an easy way to validate an entire document and not only its body #2044
Comments
Yes, the Cleaner only takes content from the body. Per the class documentation:
I'm not sure what you mean by
It is returning false because there is content in the head.
But in that case the name would imply it is vetting the HTML which it would not be. The Taking a step back, can you tell me more about what you're trying to achieve, what your input HTML is, and what you consider valid / invalid? In a world where the Cleaner does support full documents (which I am happy to consider supporting at some point), it seems that your configured Cleaner (just supporting "title") tags would always be invalid, as it will always contain html, head, body. |
Ref also #1520 |
(BTW, thank you for the feedback!) Maybe an approach would be that the Cleaner checks what tags are in the safelist. And if it sees That way there's not another method or configurator, and there's no deprecation. Existing implementations keep the same behavior (body fragment) as adding non-body tags doesn't work currently. |
Sorry, should have wrote down the way I came here which was kind of via three corners. Because, in fact, I don't need whole document validation for my current use case I was only quite confused just looking at the method how I came here:
IMO we can close this issue. Maybe we should create a new one which improves the check in Cleaner.isValid
We might want to allow In any case, I use now |
Thanks for the detail -- makes sense and helps understand the requirement. I do want to overhaul the Cleaner interface to enable optionally cleaning whole documents vs body fragments. Thinking that might simply be enabled by adding head elements to the safelist - if those are present, flip to document vs body fragment mode. And also adding more control of what happens to dropped elements - unwrapped, removed, escaped, etc. (See #2050). Will leave this as a tracker bug for now while planning the revision. |
I think another requirement here is that in one parse you can get both the cleaned Document and also the isValid response (the combination of if the parser saw any errors, and if any elements / attributes were dropped during cleaning). Another ask I've seen is for a report of what elements / attributes were dropped. Thinking then that there's a more detailed response object required from the cleaner that can provide that level of detail along with the response document. |
consider the following:
returns false but not because we added
script
to head but because there is this check inisValid
:I would have expected
isValid
analyses everything as it a) takes aDocument
and b) is not namedisValidBodyHtml
in contrast to the other available method.Since
CleaningVisitor
is private, I cannot passhtml
element instead of the body. Suggestions how I can achieve what I want?(I suggest you deprecate
isValid
and add another overloadisValidBodyHtml
which takes aDocument
)The text was updated successfully, but these errors were encountered: