Elevated Words

Analyse which words on a website are elevated by the way they are shown.
Note: Currently an english stopword list is used and applied to whatever language the site is. This will be updated soon.



Further Information

What's the idea behind?

Besides information that is more or less directly meant for a search engines to use (title, meta keywords and description) they try to understand what a given page is about so that they can optimise indexing. One aspect of doing so is to understand what has been of interest to the page author.
When adding content (in form of text) to a page or blog I normally start with writing plain text. This leads often to a solid block of text that is hard to digest for a reader.
It's of interest here to understand the way many people including myself are surfing the internet. After a new page has opened I start scanning quickly over the content to find whether it matches what I'm currently interested in. I'm actually looking for appearances of my personal buzz-words and maybe supporting images ort graphs. And that's exactly what the page author is aware of.
Therefore I try to partition parts of the text that belong together placing a summarising headline above it (h1-h6). Quotes can be used to let another person explain a topic in own words (blockquote). If seen useful an explicit summary can be placed (summary).
But even within the remaining text I want to emphasize words or phrases because there is a specific aspect I want to highlight, e.g. by writing in bold (b, strong), italics (em, i), being underlined (u) or marked (mark).
If there is text which is meant to be definitions it can be presented to the reader using tags for this (dt, dd, dfn). Tables are a great means to structure data with a crisp column description (th).
All these tags make text shine to the user to gain interest and quickly show what this page is about. Search engines are taking advantage of the authors highlighting the scope of a page using these tags. On the other side around, it is a good idea to make sure that a given page is structured and scoped using these means.
Unfortunately the known proverb The more the merrier! does not apply here. Just underlining all text on a page does not only keep readers away but also search engines weighting.

How does this tool work?

Input to this analysis is either an already published web page that can be provided by its Url or a draft html code pasted into the text area. By doing so one can easily check a page while still in finishing phase, check a page that has already been published or understand key word balance of other sites.
The page's code is analysed for each of these tags separately
[<title>, <summary>, <blockquote>, <h1> - <h6>, <dt>, <dd>, <a>, <th>, <strong>, <b>, <u>, <em>, <i>, <mark>, <dfn>]:

  1. Text inside all occurrences of analysed tag is aggregated.
  2. Text is cleaned for encodings, punctuation and put to lower case.
  3. Word list is freed from stopwords, as these don't change the scope.
    (currently only english, more to come)
  4. Word-histogram is computed from this word list and stored for this tag.
By doing so, words that occur inside nested tags will occur in each word-histogram of nesting tags and therefore in sum of individual tag-word-histograms more often than they actually occur in the text.
To gain a total overview, a separate run is done on combined text of all these tags. Here it is taken care of that no written word occurs more than once, even if in nested tags.