Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to identify escaped/unescaped html entity in the text nodes #2206

Open
Muthukirthan opened this issue Oct 6, 2024 · 4 comments
Open
Labels
needs-more-info More information is needed from the reporter to progress the issue

Comments

@Muthukirthan
Copy link

Not able to identify whether the input document has & or &amp; in the text node, since Jsoup escapes the character in text node. Same goes to other entities like </&lt;.

This does not provide any control to the Jsoup users where they can take any action based on input. For example; If we want to remove < character in text node but preserve when given as entity &lt;

Note: Please let me know if there is already a way to differentiate this.


Providing an option where I could input Jsoup to not modify the text node will be super helpful. This provides more flexibility and control to the customers.

@jhy

@Muthukirthan Muthukirthan changed the title Jsoup - Not able to identify escaped/unescaped html entity in the text nodes 💭 💭 Jsoup - Not able to identify escaped/unescaped html entity in the text nodes Oct 6, 2024
@Muthukirthan
Copy link
Author

Muthukirthan commented Oct 6, 2024

Tried different methods in TextNode to get the original input text content, but did not worked.

Example:
Input: <p> actual_lt: < || escaped_lt: &lt; </p>

for (TextNode textNode : doc.selectFirst("p").textNodes()) {
    System.out.println("textNode.toString():-" + textNode.toString());
    System.out.println("textNode.text():-" + textNode.text());
    System.out.println("textNode.getWholeText():-" + textNode.getWholeText());
    System.out.println("textNode.outerHtml():-" + textNode.outerHtml());
}

Expected (in any one of the method): actual_lt: < || escaped_lt: &lt;
Output:

textNode.toString():-  actual_lt: &lt;   ||   escaped_lt: &lt;  
textNode.text():- actual_lt: < || escaped_lt: < 
textNode.getWholeText():-  actual_lt: <   ||   escaped_lt: <  
textNode.outerHtml():-  actual_lt: &lt;   ||   escaped_lt: &lt; 

@jhy

@jhy
Copy link
Owner

jhy commented Nov 21, 2024

Can you explain the value here of this suggestion? What's a real example where this would be helpful?

@jhy jhy changed the title 💭 Jsoup - Not able to identify escaped/unescaped html entity in the text nodes Not able to identify escaped/unescaped html entity in the text nodes Nov 21, 2024
@Muthukirthan
Copy link
Author

Muthukirthan commented Nov 21, 2024

With this bug we'll not be able to differentiate whether &nbsphello or &amp;nbsphello was present in the input. For any other bugs, a method like textNode.getRawText() which returns the text node in the input as it is will help users to apply a temp fix.

&#13; in the input will be changed to CR character and &#61; will be changed equals to character once they are parsed by Jsoup.
However, I prefer html entities to not unescape after parsing them. I even tried xhtml escape mode which is expected to not change any html entities (except lt, gt, amp, and quot), but the result is always the same as base escape mode. Tried different versions and not working. Not sure if this is a bug

@jhy
Copy link
Owner

jhy commented Nov 22, 2024

Sure, but I am not clear on what you are actually trying to achieve. What feature are you trying to build that would utilize functionality like this?

jsoup parses HTML. That's fundamentally what it does. If we didn't decode, we wouldn't be parsing.

Escape modes control the output, not the input. Input HTML always decodes using the full set.

@jhy jhy added the needs-more-info More information is needed from the reporter to progress the issue label Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-more-info More information is needed from the reporter to progress the issue
Projects
None yet
Development

No branches or pull requests

5 participants
@jhy @Muthukirthan and others