Monday, July 12, 2010

Cheese Soup and the Semantic Web

If you haven't yet heard of the Semantic Web, don't worry. You will eventually. It's going to take off any day now. Really. Web 3.0. Ok, so the world has been moving very slowing in the direction of the Semantic Web. Sorta like IPv6. We'll get there eventually.

So I've decided that I want to help move this thing along. I see big potential here, and I think there are some practical applications already. Consider searching for a recipe on Google. Let's say you want to make some cheese soup. It's delicious and I highly recommend it. Your search results will probably look something like this...

That first site sounds like a great place for cheese soup recipes, but wait... Alton Brown's recipe has a bit of extra information displayed. A rating, number of reviews, prep time, and cook time. How is this possible? Google's pretty smart, but how did they manage to parse that information out of the web page? Searching for other types of recipes reveals that this information is listed on most recipes from the big recipe sites (Food Network, All Recipes, My Recipes). Here's the trick - they've already embraced the semantic web by adding RDFa to their web pages. Google has the ability to parse this information out and uses it for better search results.

It just so happens that is my site, and today we're going to work through the steps of adding RDFa to my cheese soup recipes. Not only will this enable me to better serve the cheese soup-loving public, but it will teach a lot of practical concepts in the emerging world of machine-readable web pages.

To begin, let's first understand what RDFa is. It stands for Resource Description Framework in Attributes. Basically, it's an extension to XHTML that adds a set of attributes for embedding metadata in web documents. All we'll have to do is add some simple attributes to the existing XHTML tags and we're done.

To make this work, you have to include the namespace declaration for the RDF vocabulary you want to use. You can see how I do that here in the markup from one of the recipes. I've simplified it a little for this example, but it looks pretty much like this:

<html xmlns=""

<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
<title>Cheese Soup</title>
Make sure you put the # at the end of the URL, it's required. You can use this namespace whenever you want to mark up recipes, and also people, places, events, and reviews. There are lots of other RDF vocabularies for things other than recipes (Dublin Core, Friend of a Friend) you can use, and you can even make your own. We might cover that in a later discussion.

Another thing to note in the code snippet above is the doctype declaration. I'm using the XHTML+RDFa document type because this document will be XHMTL with RDFa and I want it to be completely valid.

Now when it comes to the actual recipe, we just have to add the right attributes in the right places. There is a containing div around the whole recipe, so we'll add a typeof attribute to that indicating that the div represents a recipe. Everything contained inside the div will be associated with the recipe.
<div typeof="v:Recipe">
Inside the recipe div, we'll add property attributes in the appropriate places to mark up the various properties of recipes.
<h2 property="v:name">Canadian Cheddar Cheese Soup</h2>
Basically, all we've done here is set the "name" property of a "Recipe" to be the contents of the h2 tag. These tag contents are now doing double-duty. They're both human-readable and machine-readable. Things get a little more interesting when we get to the ingredients list.
<li rel="v:ingredient">
<span typeof="v:Ingredient">
<span property="v:amount">1 lb</span>
<span property="v:name">shredded cheddar cheese</span>
<li rel="v:ingredient">
<span typeof="v:Ingredient">
<span property="v:amount">1/2 lb</span>
<span property="v:name">bacon</span>
You'll notice that the li tags contain a rel attribute. This indicates a relationship with another resource, which in this case in an ingredient. Ingredients are types of objects just like Recipes, so inside the li tag we put a span with the typeof attribute indicating that its contents are an Ingredient. Ingredients have names and amounts, so spans inside have property attributes to annotate them. All of this extra information is included for the machines to read, and the humans don't even notice!

There's just one more special thing to consider. What happens when you want the made-for-human content to look different than the made-for-machines content? This happens especially when working with times. RDF wants everything in a particular date/time format, which happens to be ISO 8601 duration format. Fortunately, there is a simple solution. An optional content attribute overrides the contents of the element, so the human sees one thing and the machine sees something else. Here's an example of this for the preparation time.
<p>Preparation time: <span property="v:prepTime" content="PT15M">
15 minutes</span></p>
The rest of the recipe (instructions, rating, etc.) can annotated in a similar fashion. You can see the complete document here. As you can see, the page looks perfectly normal despite having all this great "hidden" information. Be sure to view the source of the page to see all the RDFa markup, and notice the link at the bottom. The page successfully validates as XHTML + RDFa using the W3C validation tool.

If you're interested in everything that the Google bots can use from your RDFa markup, check out this page on Google Webmaster Tools. Hopefully, my cheese soup recipes will get re-indexed soon and the extra info will show up in the search results. But let's not do this just for Google searches. There's amazing potential here if we annotate all of our web pages with as much RDF as possible. Check out the Dublin Core and Friend of a Friend vocabularies. The Semantic Web will only appear if we build it!