Back in the early 1900s electrical engineers were focused on a problem: how can we send more information, faster. This was, of course, of major interest to the military as it allowed for faster response and better coordination.
If you wanted to send information across long distances you did it using a telegraph and Morse code. A message was encoded using a series of beeps and, on the receiving end, the receiver would listen to those beeps and decode them. If you had two very good telegraphers you could trasmit 10 - 15 words a minute.
Not very fast, especially when 5 of those words not essential to the information in the message itself:
REGRET TO INFORM YOU YOUR BARN BURNED DOWN DURING THE NIGHT | YOUR COWS ARE SAVED | THE PIGS WERE LOST
Sending information from one point to another involves a few terms we need to get to know:
In the message above, the receiver is informed that their barn burned down, which is something the receiver didn't know prior to receiving the message. Here's another message that might follow:
WEATHER CLEAR TODAY IN THE 70s WITH SLIGHT CHANCE OF RAIN | CLEANUP SHOULD BE CARRIED WITHOUT PROBLEM
These two messages contain roughly the same number of characters but one message is a bit more impactful, or surprising than the other. Learning that your barned burned down has much more impact (and therefore more information) than a mundane weather report. But how much more?
If you lived in the early 1900s and stood next to your telegraph, staring at it, you would receive a number of messages:
Messages don't just happen, they are "pulled from the chaos" of all possible things that could happen. This is the main definition of a message that we'll be working on from here on out:
It turns out that this is the key to quantifying information.
Entropy is the "degree of randomness" of a system. If we're talking about a binary system (1s and 0s) there is very little entropy: something can be a 1 or a 0. If we're talking about a message in Morse code there's a bit more entropy because we have 26 uppercase alpha-numeric characters, 10 digits and a space.
If the message is in the English language, there are 240,000 possible words. We can pare that down a bit to "common words" which brings it to more like 20 or 30,000. Either way - the degree of randomness tells us a lot.
The idea is a simple one: the greater the entropy, the greater the information a message contains. This only works, however, if we separate the meaning of a message from the information it contains. Yes, it's sad if your barn burns down! It's the chance of you reading those words, however, that define the information.
Fire, cows saved, pigs lost.
Those words, in that order, have a very low probability and thus a high amount of information. Ralph Hartley, the first engineer to try to calculate the speed of information, gave us the very first equation to express this idea:
H(M) = log |M|
The entropy, H
, of a message M
is equal to the log
of all possible messages from that system |M|
. In the case of our barn burning down, "all possible messages" would be the 240,000 words in the English language.
To calculate the entropy of this message we simply need to multiply the entropy of each word. If I tell you your barn burned down, that's 4 words, or individual messages that make up the whole:
H("YOUR BARN BURNED DOWN") = log(240000) * 4
That gives us "21 Hartleys" of information, a unit of measure you've probably not heard before. Me neither. But that's not really the weird thing here.
We have a bigger problem, which is that I could have said "THE SKY IS BLUE" and, using Hartley's equation, that would contain the exact same amount of information as "YOUR BARN BURNED DOWN", which is absolutely not the case.
This is where we get to meet Claude Shannon again, who expanded on Hartley's idea, taking into account the probability of each message. To Shannon, the amount of information in a message is inversely related to its probability - something Shannon called the "surprise" of a message.
That, Shannon reasoned, is the true measure of information and he calculated it using a simple probability equation:
s(M) = log2(1/p(M))
The surprise, s
of a given message M
is equal to the log (base 2) of 1/p(M). Why log 2? Because Shannon was working in the world of digital communication, sending messages using binary meant that the 1 or 0 was the correct way of thinking about (and quantifying) information.
So, the surprise of the message "YOUR BARN BURNED DOWN" is directly tied to the probability of each of those words coming in. "YOUR" is pretty normal, "BARN" isn't all that common but following "YOUR" it's not unexpected. "BURNED" is extremely unexpected but, once again, the word "DOWN" is not, because it follows the word "BURNED".
To Shannon, the words "YOUR" and "DOWN" would be redundant in this message and could likely be stripped out, conveying close to the same idea: "BARN BURNED". Not exactly the same as there is a bit of ambiguity, but this is a very important breakthrough!
Shannon gave us a way to think about shortening a message without losing too much information. Shorter messages means more efficient transmission! Brilliant!
This led to Shannon's Entropy equation, which is one of the most important equations ever produced:
If all of this has seemed academic up until now, this is where it all gets real. We use data compression every day - from GZIP on web pages to image compression in our texts. It all started right here with Claude Shannon's realization that words could be stripped from a message without altering its information.
We'll get to that in the next video.