Rich text truncation and excerpts #188
Replies: 2 comments 1 reply
-
I have occasionally thought about excerpts and truncation, and my soft conclusion is that there's no benefit in trying to implement it in the library. The renderers can't easily keep track of how much content has been output so they wouldn't be able to stop rendering at the cutoff point to save CPU time. They don't keep track of what elements are open either. If I had to truncate HTML, the first thing I'd try to use as reference and benchmark, is to load the mangled portion of the HTML in DOM and let it fix it. $string = 'Hello world <a href="#" rel="nofollow">link</a>';
$html = substr($string, 0, 14);
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED);
die($dom->saveHTML()); Obviously this isn't wordsafe and you'd be measuring the length of the full markup rather than the actual text displayed so it's not really useful. I subscribed to mrcgrtz/php-shorten#16 because that's an interesting problem and I'm curious about it. |
Beta Was this translation helpful? Give feedback.
-
Yeah, I saw a few other libraries that appear to share some DNA with mrcgrtz/php-shorten. The CakePHP util is a different codebase but it seems to work more or less the same way. Unless you know exactly the kind of HTML you're working with, I don't think it's realistic to guarantee the HTML's well-formedness in PHP; It's just too complicated to really track what elements need closing because of optional or omissible end tags. If you use PHP's ext/dom, you get a more robust implementation of the HTML specs. The HTML 4.0 specs, that is. If you want a library that supports a version of HTML that's not of voting age, there are userland implementations but at this point you're starting to run something like 20-40 KLOC to truncate a 400 bytes string and you're wondering if you're not going a tad overboard. And by you, I mean me. Anyway, for some reason I can't stop code-golfing about doing most of the heavy lifting with PCRE and I came up with the following, which I'm leaving here in case somebody ends up finding this thread from a search engine: function totally_legit_truncate(string $html, int $len, bool $exact = false)
{
$boundary = ($exact) ? '\\b' : '';
$regexp = '(^(?:(?:<[^>]++>)*+(?:&#?[a-z0-9]++;|.)){0,' . $len . '}' . $boundary . ')is';
if (preg_match('([\\x80-\\xFF])', $html))
{
$regexp .= 'u';
}
preg_match($regexp, $html, $m);
$dom = new DOMDocument;
$dom->loadHTML($m[0], LIBXML_HTML_NODEFDTD | LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED);
return $dom->saveHTML();
}
$html = 'Hello world <a href="#" rel="nofollow">link</a>';
$html = totally_legit_truncate($html, 14);
var_dump($html); You can quickly go through HTML with PCRE (you can even skip entire spans like |
Beta Was this translation helpful? Give feedback.
-
Hi @JoshyPHP,
I'm trying to show an "excerpt" of some TextFormatter-formatted content, and I'm wondering if this is something that could be done in the library itself.
My initial approach has been to perform truncation on the rendered HTML. The best library I have found to do this is mrcgrtz/php-shorten). This library has some bugs, and in general trying to safely truncate HTML without loading up a full DOM parser seems a bit fragile.
It occurred to me that when TextFormatter renders parsed content, it's already going through with an XML parser (as I understand it). What if I could tell the library: "stop rendering after X text characters (preferably word-safe), append
...
, and close all open tags"?Is there a way to do this? What do you think would be the best approach?
Beta Was this translation helpful? Give feedback.
All reactions