Rich text truncation and excerpts #188

tobyzerner · 2022-02-03T03:23:05Z

tobyzerner
Feb 3, 2022

I'm trying to show an "excerpt" of some TextFormatter-formatted content, and I'm wondering if this is something that could be done in the library itself.

My initial approach has been to perform truncation on the rendered HTML. The best library I have found to do this is mrcgrtz/php-shorten). This library has some bugs, and in general trying to safely truncate HTML without loading up a full DOM parser seems a bit fragile.

It occurred to me that when TextFormatter renders parsed content, it's already going through with an XML parser (as I understand it). What if I could tell the library: "stop rendering after X text characters (preferably word-safe), append ..., and close all open tags"?

Is there a way to do this? What do you think would be the best approach?

JoshyPHP · 2022-02-03T15:31:26Z

JoshyPHP
Feb 3, 2022
Maintainer

I have occasionally thought about excerpts and truncation, and my soft conclusion is that there's no benefit in trying to implement it in the library. The renderers can't easily keep track of how much content has been output so they wouldn't be able to stop rendering at the cutoff point to save CPU time. They don't keep track of what elements are open either.

If I had to truncate HTML, the first thing I'd try to use as reference and benchmark, is to load the mangled portion of the HTML in DOM and let it fix it.

$string = 'Hello world <a href="#" rel="nofollow">link</a>';

$html = substr($string, 0, 14);

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED);

die($dom->saveHTML());

Obviously this isn't wordsafe and you'd be measuring the length of the full markup rather than the actual text displayed so it's not really useful.

I subscribed to mrcgrtz/php-shorten#16 because that's an interesting problem and I'm curious about it.

1 reply

tobyzerner Feb 4, 2022
Author

Thanks for your thoughts @JoshyPHP. It certainly is an interesting problem! I just found this CakePHP util as well which is proving to be a bit more reliable.

JoshyPHP · 2022-02-04T17:28:12Z

JoshyPHP
Feb 4, 2022
Maintainer

Yeah, I saw a few other libraries that appear to share some DNA with mrcgrtz/php-shorten. The CakePHP util is a different codebase but it seems to work more or less the same way. Unless you know exactly the kind of HTML you're working with, I don't think it's realistic to guarantee the HTML's well-formedness in PHP; It's just too complicated to really track what elements need closing because of optional or omissible end tags. If you use PHP's ext/dom, you get a more robust implementation of the HTML specs. The HTML 4.0 specs, that is. If you want a library that supports a version of HTML that's not of voting age, there are userland implementations but at this point you're starting to run something like 20-40 KLOC to truncate a 400 bytes string and you're wondering if you're not going a tad overboard. And by you, I mean me.

Anyway, for some reason I can't stop code-golfing about doing most of the heavy lifting with PCRE and I came up with the following, which I'm leaving here in case somebody ends up finding this thread from a search engine:

function totally_legit_truncate(string $html, int $len, bool $exact = false)
{
	$boundary = ($exact) ? '\\b' : '';
	$regexp   = '(^(?:(?:<[^>]++>)*+(?:&#?[a-z0-9]++;|.)){0,' . $len . '}' . $boundary . ')is';
	if (preg_match('([\\x80-\\xFF])', $html))
	{
		$regexp .= 'u';
	}

	preg_match($regexp, $html, $m);
	$dom = new DOMDocument;
	$dom->loadHTML($m[0], LIBXML_HTML_NODEFDTD | LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED);

	return $dom->saveHTML();
}

$html = 'Hello world <a href="#" rel="nofollow">link</a>';
$html = totally_legit_truncate($html, 14);

var_dump($html);

You can quickly go through HTML with PCRE (you can even skip entire spans like <script.*?</script>) as long as you don't care about tracking element names or even knowing how many characters are actually in the text. One caveat is that PCRE won't be able to compile the regexp when the number of repetitions gets too high: around 750 in Unicode mode, and probably around 2000 otherwise.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rich text truncation and excerpts #188

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Rich text truncation and excerpts #188

tobyzerner Feb 3, 2022

Replies: 2 comments · 1 reply

JoshyPHP Feb 3, 2022 Maintainer

tobyzerner Feb 4, 2022 Author

JoshyPHP Feb 4, 2022 Maintainer

tobyzerner
Feb 3, 2022

Replies: 2 comments 1 reply

JoshyPHP
Feb 3, 2022
Maintainer

tobyzerner Feb 4, 2022
Author

JoshyPHP
Feb 4, 2022
Maintainer