Readability analysis is used to tell how readable a text is. While very old, it just start to be used as a criteria for page analysis. In this post, I will try to explain why the goal of readability is misunderstanded. In particular for a web content analysis, what people want try to analyze is understandability and not readability.
Readability analysis
First let’s briefly look at what a readability analysis is. It is a set of algorithms designed to compute an index that allows to tell how hard a text is to read.
This was primary invented to ensure that school book can be read by kids.
The idea behind any readability test is very simple: long word and long sentence are more hard to read than shorter one. There is no magic in these criterion, they are purely syntactic.
To make it clearer, lets illustrate how work one of the most popular readability index : the gunning-fog index which have been invented in 1952. Note that is was designed to works against English text.
Gunning-fog principle
To evaluate the readability difficulty of a text the gunning-fog work like this :
- Find the average sentence length (divide the number of words by the number of sentences).
- Count words with three or more syllables (complex words), not including proper nouns (for example, Intel or Iphone), compound words, or common suffixes such as -es, -ed, or -ing as a syllable, or familiar jargon.
- Add the average sentence length and the percentage of complex words (ex., +13.37%, not simply + 0.1337)
- Multiply the result by 0.4
The gunning-fog formula is (taken from wikipedia):

This formula is pretty straightforward except the 0.4 number. This can be view as a “weight” in order to make a correspondence between school grades and index value. This is of course totally arbitrary.
Because they are very simple to implements, readability index have became quite popular now days. Many blogs post details them (here and here for example) and Web analysis ’smart’ tools use them such as websitegrader
Here is a screenshot of this blog analysis done by websitegrader (I really love this service):

As you can see, this blog have a readability index of 4th grade. Indeed a 4th student can read it but However I doubt that a 4th grade student will be able to understand it. To emphases even more that if short words and sentences are easier to read, it does not mean that they are the most understandable; lets make a little example.
Gunning-fog example
Considers the two following texts reading index. I used textalyser.net to compute the gunning fox index (so you can do it at home if you dont believe me
). The first is taken from a cookie receipt (I love cookie
) by Peggy Trowbridge Filippone (Thanks for your columns
) :
“If you like pecan sandies, you’ll love this chocolate variation of a favorite cookie. These cookies are not overly sweet, yet have a rich chocolate flavor. Easy to make and great for shipping as gifts from the kitchen, since they hold up well.”
Text analyzer reports the following results for this text:

The second is taken from a graphic card review done by the very popular site Tom’s hardware (Love your reviews too
) :
“The GeForce 8800 GT 256 MB is a 256 MB version of the GeForce 8800 GT. The decreased memory capacity is also slower. With the official specifications at 700 MHz, the bandwidth is 22% lower. It’s a difference that’s bigger than that of the first generation of GeForce 8800 GTS (640 and 32 MB), but smaller than between the GeForce 7800 GTX and its 512 MB version”

As you can see these two texts are completely equivalent from a syntactic point of view (Number of word, Average sentence length etc…) however if you look at this Gunning-fog index you see that the cookie receipt is harder to read that the graphic card review (8.8 versus 8.9). It might be harder to read but it is obvious that the technical article is harder to understand. Which is exactly my point.
You might also have notice that the readability index at the bottom give an opposed ranking, this is because it use a different formula (I dont know which one). But with the appropriate technical and receipt article it can still rank the receipt has harder than the technical one.
At this point I should have convinced you that, a readability index is not meant to tell you how hard to understand your text is. Or if you are involved in writing (in the broad sense: blogging, marketing, advertising …) this is exactly the metric you are looking for. You want something that tell you if your text can be understand by your target audience. So what you really want is not a readability analysis but an understandability analysis. If not such metric exist currently, it is a research hot topic. As a matter of fact a research article have been release on the subject a few weeks ago.
Understandability analysis
So how can we measure the understandability of a text ? Understanding a text means to understand the concept beyond the text. For example if I read a text about blog optimization for search engine, I need to understand the concept of Blog and Search Engine to be be able to understand the text. It is not syntactic, it is semantic. Hence an understandability criteria need to be semantic. That is why it is so difficult to make. Counting word or word length is easy, determining the meaning of a word or a sentence is not.
To conclude this post, lets take a look at some of the most promising taking to achieve understandability analysis
Lexical density
First of all trying to determine the understandability of a text by looking at each word is not relevant because many word such as “love, neat, great” can be view as “blank” for understandability. For example considers the two sentences:
“a great toy”
which is easy to understand and
“EXPTIME-complet is a great complexity”
which is not understandable unless you know about algorithmic complexity. In this context it is useful to take into account the lexical density of the text to isolate the words that are recurrent. A rule of thumb (I can’t remember who say it first) is that “the more a word appears in a text the more it is relevant for the text topic”. For example look at this Steve Job speech lexical density (taken from Russell’s blog). The bigger is the word, the more it has been said :

Even without knowing which speak it is, you can easily deduce that was about the iphone
From the understandability point of view Job’s speech requires to know at least what is a phone and an ipod are. If you don’t understand what a processor is, you probably be lost for a few slide but you sill get the key message.
Machine learning and word mapping
Now that we know which word we want to analyze, one question remains. How to estimate the level of understanding for a given word ? A way to do it is to use a machine learning algorithm (labeled). This is not has hard as it sound. You have already heard of machine learning algorithm: They are used to tell for spam filtering. They tell you if a text is a ham (legitimate text) or spam. The most widely used algorithm for this is the (naive) bayesian filter one.
it works by assigning to each word (token) an probability of being a spam and then compute for a given message it probability as the sum of each token probability. This is not completely accurate but this is the general principle. We say that it is a text classifier: it classify a text as spam or ham.
This classification can be extended to more that two categories spam and ham. We can imagine a Bayesian filter that classify text into categories such as : easy to understand, average, hard and specialized.
Before being able to classify a text the Bayesian classifier need to be trained. It need a corpus of text for every level of understandability wanted. This lead to three problems : first the classifier will be unable to classify a word if it haven’t see it before. For example a new coined term or a very technique term. Secondly it is language specific. For example the term “deja vu” is very common in french (trust me on this
) whereas “deja vu” is not that common in English (at least for a 6th grade kid). Finally creating text corpus is a huge burden to make.
Conclusion
Understadanability analysis is something that can be very handful for many people, from writer to bloger to advertiser, but at the opposite of readability criteria there is no simple mean to compute an index. If algorithm and principles are available as foundation, there will be sometime before it will be available.
Search engine can also use understandability analysis: wouldn’t be nice to be able to restrict search to specialized articles or kids articles ?


Latest Comments