Generating Abstract Text from HTML content

  

In the world of online content management, we often want to introduce an article with a summary or abstract. The best example of this is the front page of a blog. You are presented with a list of the most recent articles with just a few paragraphs from each to get your interest. If want to continue you can click through and get the full text. Until recently, this blog didn't do that. It's a pretty important addition for me because my posts tend to run long, and my Home Page was getting pretty lengthy from just 3 posts. Here's the rundown on adding this feature.

A Few Requirements

First a little setup. Ideally this summary would be written specifically for the article and formatting nicely for it's space in the page. But since we live in the real world, we are generally given a few requirements for our system to make this easier. These are the general guidelines I used:

  • The summary should be generated automatically from the first paragraph of the full text (or first 2 or first 3 ...)
  • The summary should preserve any html that may be present (i.e. links, bullet points, etc).
  • The editor (me) should have the option of overriding this auto-generated summary with a custom one.

The Tools

I wanted to see how simple I could make this with the system I'm using. I have a few things going for me.

Almost all of the html content is available as well-formed xml. This means I can manipulate xml node objects instead of parsing text. Using text manipulation techniques to modify tag-based html has always bothered me. You end up either writing lots of code to handle lots of different possibilities or relying on complex regular expressions. By representing your content as xml (or in this case xhtml) you can be a little more confident that your output will still be well-formed and valid. And you can probably manage your changes with less code.

It's not DOM, it's e4x. This system is Java based. And yes, that's a good thing. Because that allows me to use the Rhino JavaScript Engine. In my opinion, javascript is a very nice scripting language. And working with it on the server as well as the client side is pretty nice. But more importantly, Rhino has native xml support through the e4x specification. E4x is much easier to work with than the w3c DOM spec or most language APIs for manipulating xml. This makes some of the code look a little funky.

It's fully object oriented. This isn't really an objective advantage. I just prefer it. All it means in this case is that all of the functions I'm using are actually methods on an object. The object represents my blog article and I have access to any of the properties of the article through the "this" keyword. I'll talk about this more in the second part to Why you should be using an Object Database instead of ORM.

The Code

The first step is a function that returns the summary. I'm calling it "abstract" and the first pass just returns it.


function get_abstract(){
return this.post_abstract;
}

This is just a getter function. But what if no abstract has been provided? Then we're ready to generate it. So let's update this a bit.


function get_abstract(){
if(!this.post_abstract){
this.post_abstract = this.generate_abstract();
}

return this.post_abstract;
}

A few things to notice here. First off, the "post_abstract" is a property on our blog article. That means when we give it a value, it's automatically saved into our database. There's nothing else we need to do to be able to access the value of this property again later. Second, the "generate_abstract" function is also a method of our blog article, so it needs to be accessed through "this".

So we've got a simple function that returns our abstract. And if there isn't one, it calls a function to get one. Let's check that out.


function generate_abstract(){
var limit = 3;
var tmpbody = this.body.elements();

var abs = <></>;
for(var idx in tmpbody){
abs += tmpbody[idx];
if(abs.length() >= limit) break;
}

return abs;
}

This is a pretty simple function. It grabs all of the elements out of the body of the article. These are xml elements representing our xhtml. Probably paragraph tags. But we don't have to actually worry about it. It could be any tpye of element. We loop through these elements and grab as many as we need. Then we add them to our temporary abstract. This "abs" variable is the only place i this function where you see the e4x peeking through. We start by creating an empty XMLList. This is essentially an xml fragment with no top level container element. That describes the html body of our blog post perfectly. Here's a snippet of the html of this post.



<p>In the world of online content management, we often want
 to introduce an article with a summary or abstract.
The best example of this is the front page of a blog.
You are presented with a list of the most recent articles
with just a few paragraphs from each to get your interest.
If want to continue you can click through and get the full
 text. Until recently, this blog didn't do that. Since I
I haven't seen a lot of</p>
<h2>A Few Requirements</h2>
<p>First a little setup. Ideally this summary would be
written specifically for the article and formatting nicely
for it's space in the page. But since we live in the real
world, we are generally given a few requirements for our
system to make this easier. These are the general guidelines
 I used:</p>
<ul>
<li>The summary should be generated automatically from
the first paragraph of the full text (or first 2 or first
3 ...)</li>
<li>The summary should preserve any html that may be
present (i.e. links, bullet points, etc).</li>
<li>The editor (me) should have the option of overriding
 this auto-generated summary with a custom one.</li>
</ul>
...

So in our list of elements would look something like this [<p>, <h2>, <p>, <ul>, ...]. We grab the first 3 to form our abstract, and return the new XMLList. That's our summary! The first 3 html elements from our body text. It gets saved on our blog article object so that any time we need it, it's available. No sense in doing this work every time. So that's it. I hope this has given you a little taste of how things work on this blog. This doesn't even scratch the surface of the power and convenience in using e4x to manipulate your content.

Before we wrap, I have something to confess. The get_abstract code doesn't work. This was a weird gotcha moment in the e4x spec that I had to work around. When working with an XMLList, calling the elements() function doesn't return the first level children of the list. It returns the children of all of those elements. So instead of my <p> tags I got the text nodes inside them. Once I figured it out, there was a simple addition to account for it. Just wrap the body in a top level node first. Then the call to elements will give you the child nodes you expect. As an after-thought, I've included the ability to specify the length of the abstract. Here's the final version of the function.


function generate_abstract(elimit){
var elimit = elimit || 3;

var tmpbody = <div></div>;
tmpbody.appendChild(this.body);
tmpbody = tmpbody.elements();

var abs = <></>;
for(var idx in tmpbody){
abs += tmpbody[idx];
if(abs.length() >= limit) break;
}

return abs;
}

Oh yeah, replacing the full text on the home page with the abstract text was pretty simple. Instead of calling this.body, call this.get_abstract. Pretty straight up. Enjoy the shorter home page.

7 Responses to this Article

  • Added by Dan on

    Why not just merge get_abstract and generate_abstract, and short circuit the function if an abstract exists?

    Alternatively, a different take on get_abstract:

    function get_abstract(){
    return this.post_abstract?this.post_abstract:this.generate_abstract();
    }

  • Added by Asskicker on

    is the .elements issue with Rhino or just with the spec itself? Seems that having to create another body object is a bit memory intensive.

  • Added by Marco on

    @Dan - That's actually the exact code I had for the first draft of this function. But it never saves the output of generate_abstract. So post_abstract would never actually be populated. It would always fall through to calling the function.

    As to mergeing get/generate, I wanted to have generate_abstract as a separate function that could be called at any time without changing the actual saved version on the site. Because post_abstract could also be a manually input version of the abstract.

    Thanks for the comment.

  • Added by Marco on

    @Masroor

    That is correct behavior per the spec. There are a few oddities like that because they intentionally blur the line between and XML object with a top level element and an XMLList object which is a collection of nodes. They have the same functions, but the functions may behave differently.

    I agree with what you're gonna say next, which is that's pretty stupid.

    In terms of it being memory intensive, you're probably right. It'd be interesting to see what kind of memory footprint these e4x objects have. Especially when being copied like I'm doing.

    But I'm a big fan of avoiding premature optimization. And I didn't see a sensible way around this hole in the spec. So until this blog starts getting a ton of traffic, I doubt I'll worry about it too much.

    Thanks

  • Added by Dan on

    @Marco: ZOMG You are right!
    But I still think they ought be merged.

  • Added by Dan on

    @Marco: I just changed my mind after reading the rest of your comment.

  • Added by ElenaLisvato on

    I don’t usually reply to posts but I will in this case. I’ve been experiencing this very same problem with a new WordPress installation of mine. I’ve spent weeks calibrating and getting it ready when all of a sudden… I cannot delete any content. It’s a workaround that, although isn’t perfect, does the trick so thanks! I really hope this problem gets solved properly asap.

Add Your Comment