Not able to identify escaped/unescaped html entity in the text nodes #2206

Muthukirthan · 2024-10-06T06:59:50Z

Not able to identify whether the input document has & or & in the text node, since Jsoup escapes the character in text node. Same goes to other entities like </<.

This does not provide any control to the Jsoup users where they can take any action based on input. For example; If we want to remove < character in text node but preserve when given as entity <

Note: Please let me know if there is already a way to differentiate this.

Providing an option where I could input Jsoup to not modify the text node will be super helpful. This provides more flexibility and control to the customers.

@jhy

The text was updated successfully, but these errors were encountered:

Muthukirthan · 2024-10-06T09:13:01Z

Tried different methods in TextNode to get the original input text content, but did not worked.

Example:
Input: <p> actual_lt: < || escaped_lt: < </p>

for (TextNode textNode : doc.selectFirst("p").textNodes()) {
    System.out.println("textNode.toString():-" + textNode.toString());
    System.out.println("textNode.text():-" + textNode.text());
    System.out.println("textNode.getWholeText():-" + textNode.getWholeText());
    System.out.println("textNode.outerHtml():-" + textNode.outerHtml());
}

Expected (in any one of the method): actual_lt: < || escaped_lt: <
Output:

textNode.toString():-  actual_lt: &lt;   ||   escaped_lt: &lt;  
textNode.text():- actual_lt: < || escaped_lt: < 
textNode.getWholeText():-  actual_lt: <   ||   escaped_lt: <  
textNode.outerHtml():-  actual_lt: &lt;   ||   escaped_lt: &lt;

@jhy

jhy · 2024-11-21T04:26:28Z

Can you explain the value here of this suggestion? What's a real example where this would be helpful?

Muthukirthan · 2024-11-21T06:42:47Z

With this bug we'll not be able to differentiate whether &nbsphello or &nbsphello was present in the input. For any other bugs, a method like textNode.getRawText() which returns the text node in the input as it is will help users to apply a temp fix.

 in the input will be changed to CR character and = will be changed equals to character once they are parsed by Jsoup.
However, I prefer html entities to not unescape after parsing them. I even tried xhtml escape mode which is expected to not change any html entities (except lt, gt, amp, and quot), but the result is always the same as base escape mode. Tried different versions and not working. Not sure if this is a bug

jhy · 2024-11-22T04:06:29Z

Sure, but I am not clear on what you are actually trying to achieve. What feature are you trying to build that would utilize functionality like this?

jsoup parses HTML. That's fundamentally what it does. If we didn't decode, we wouldn't be parsing.

Escape modes control the output, not the input. Input HTML always decodes using the full set.

Muthukirthan mentioned this issue Oct 6, 2024

Non-terminated HTML Entities are not recognized properly #2207

Closed

Muthukirthan changed the title ~~Jsoup - Not able to identify escaped/unescaped html entity in the text nodes 💭~~ 💭 Jsoup - Not able to identify escaped/unescaped html entity in the text nodes Oct 6, 2024

Muthukirthan mentioned this issue Oct 27, 2024

EscapeMode.none #411

Closed

jhy changed the title ~~💭 Jsoup - Not able to identify escaped/unescaped html entity in the text nodes~~ Not able to identify escaped/unescaped html entity in the text nodes Nov 21, 2024

jhy added the needs-more-info More information is needed from the reporter to progress the issue label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to identify escaped/unescaped html entity in the text nodes #2206

Not able to identify escaped/unescaped html entity in the text nodes #2206

Muthukirthan commented Oct 6, 2024

Muthukirthan commented Oct 6, 2024 •

edited

Loading

jhy commented Nov 21, 2024

Muthukirthan commented Nov 21, 2024 •

edited

Loading

jhy commented Nov 22, 2024

Not able to identify escaped/unescaped html entity in the text nodes #2206

Not able to identify escaped/unescaped html entity in the text nodes #2206

Comments

Muthukirthan commented Oct 6, 2024

Muthukirthan commented Oct 6, 2024 • edited Loading

jhy commented Nov 21, 2024

Muthukirthan commented Nov 21, 2024 • edited Loading

jhy commented Nov 22, 2024

Muthukirthan commented Oct 6, 2024 •

edited

Loading

Muthukirthan commented Nov 21, 2024 •

edited

Loading