Information Theory for Dummies

  • Aggressive
  • Amazed
  • Amused
  • Angelic
  • Angry
  • Artistic
  • Asleep
  • Bashful
  • Blah
  • Bored
  • Breezy
  • Brooding
  • Busy
  • Buzzed
  • Chatty
  • Cheeky
  • Cheerful
  • Cloud 9
  • Cold
  • Cold Turkey
  • Confused
  • Cool
  • Crappy
  • Curious
  • Cynical
  • Daring
  • Dead
  • Depressed
  • Devilish
  • Doh
  • Doubtful
  • Drunk
  • Energetic
  • Fiendish
  • Fine
  • Flirty
  • Gloomy
  • Goofy
  • Grumpy
  • Happy
  • Hot
  • Hung Over
  • In Love
  • In Pain
  • Innocent
  • Inspired
  • Lonely
  • Lurking
  • Mellow
  • Mischievious
  • Nerdy
  • None
  • Not Worthy
  • Paranoid
  • Pensive
  • Psychedelic
  • Question
  • Relaxed
  • ROFLMAO
  • Sad
  • Scared
  • Shocked
  • Sick
  • Sleepy
  • Sneaky
  • Snobbish
  • Spaced
  • Stressed
  • Sunshine
  • Sweet Tooth
  • Thinking
  • Tired
  • Twisted
  • Vegged Out
  • Worried
  • Yee Haw
  • Results 1 to 6 of 6
    1. #1
      FreezBee's Avatar
      FreezBee is offline Blu Ray Laser Phaser
      Asleep
       
      Join Date
      February 8th, 2006
      Location
      Copenhagen
      Posts
      32,406
      Male - churchburner
      Mentioned
      0 Post(s)

      Talking Information Theory for Dummies

      As a service to those who may have problems following the arguments in certain threads in this forum, here is a few bits of information theory of relevance.

      Assume a (discreet) probability space L and an event E in L with probability P(E). The information associated with E is then

      I(E) = -log(P(E)) for all E in L

      where "log" refers to the binary logarithm. The reason for the use of the binary logarithm can be given as follows. Assume that we are to guess a (positive, whole) number between 1 and 128, both inclusive. Our probability space here consists of an event for each of the 64 possible number, namely the choice of that number to be the one to guess for. Assuming all numbers to be equally likely and calling the number chosen x we have

      I(x = N) = -log(1/128) = 7 for all N in {1, 2, ..., 128}

      We might start out guessing at 1, then 2, and so on, or some other scheme with guessing at one possible number at a time. With such a scheme, the minimum number of guesses required is 1 (if we are right in the first try), the maximum number of guesses required is 127 (if we guess wrong for the 127th time, we have only 1 choice left and know it's the right one), and the average number of guesses required is 64. We can minimize the maximum number of guesses required by bisecting the remaining choices before each guess - bisecting, because we are only allowed a simple yes/no-question. Therefore the binary logarithm.

      In return, by the same token - it's the maximum number of guesses required - we can say that I(E) is the information content of the occurence of E. Note here that a smaller probability corresponds to a higher information content, and vice versa. In particular, if an event is certain to occur (probability = 1), then there is an information content of 0 associated with it (-log(1) = 0).

      This last is of importance in ID (Intelligent Design) theory. In a deterministic world model, all events that will occur are certain to occur, so no new information can be added. That's the claim. However, the problem is, that probabilities and information are not the same. A scientist observing an event, even if that event is fully deterministic, can gain information, because the scientist might not beforehand have had enough knowledge to predict exactly what event would occur. The scientist might have had a qualified estimation of a certain number of possible (as in not excluded by his then current knowledge) events and have assigned estimated probabilities to them. Which event actually happened to occur did increase the information that the scientist had, but that is a posteriori information, not a priori information. This is where IDist go wrong, they try to treat information as a given substance similar to matter and energy, but it doesn't work.

      Say I was to compute the 100th prime number. It's 100% certain which number I end up with (assuming I do my calculations correctly), but that does not mean that the answer provides me with no information.

      Talking about this, let's assume we are to write a computer program to calculate prime numbers up to, say, the 10,000th. One way would be to precalculate them somehow and then enter them into a table, the program then simply makes a table lookup each time it's asked for the Nth prime number. Another way is to write an algorithm that will compute the Nth prime, for instance by every time computing all the prime numbers up to and including the Nth, and for each candidate diving that number by all the numbers smaller than or equal to its square root. As of now, there is no other known algorithmic way of computing prime numbers. The first approach takes up some space in memory, but it's very quick to calculate the primes, and the speed is independent of N. The second approach takes up very little space in memory, but it's very slow, and it becomes as N increases. Which approach is the most intelligent?

      We'll not answer that question, but look at these two programs from another angle. The first program requires at least as many bits in its representation (including the table of prime numbers) as its possible output (here only counting a prime number the first time it's output), whereas the second program requires very few bits in its representation, especially compared with its output. In ID terminology only the second program is a specification, since a specification must be shorter thatn what it specifies. Say that what is specified requires T bits, and the specification requires S bits, if S + 500 < T, then we have a complex specification, otherwise not.

      Why exactly this number 500? Well, first we notice it's information, so it must correspond to a probability p such that 500 = -log(p). We calculate p to be 10e-150 (ignoring rounding issues as always in the exact sciences), and what's so sacred about this number? It's William Dembski's UPB (Universal Probability Bound), the lowest probability we ever need to consider, and it comes around like this: the number of elementary particles in the universe is 10e80 (is that so certain? we might ask), the maximum possible number of elementary particle transitions (the inverse of the Planck time) per second is 10e45 (this is assuming time to be quantified rather than continuous), and the number of seconds in a billion times the current age of the universe is 10e25 (Dembski is an Old Earther!). Multiplying together we get

      1/UPB = 10e80 x 10e45 x 10e25 = 10e150

      Ok, but that is of course only of any relevance assuming that all events are equally probable!

      Dembski also distinguishes between replicational ressources, the relevant (as determined by whom?) ways an event can occur, and specificational ressources, the relevant (as determined by whom) ways an event can be specified. These together comprise the probabilistic ressources. Let's assume again an event E, which lies within a region R of events. We will call R the rejection region of a hypothesis H, if the probability P(R | H) < a, where a, the significance level, is usually chosen to be 0.1, 0.05, or 0.01. Dembski suggest the following value: α = ˝ ÷ (RR × SR), where RR = #replicational ressources, and SR = #specificational ressources.

      Ok, so now you should be a bit better informed about, what Jorge is referring to


      - FreezBee
      From darkness into light
      Like icy shards from the broken mirror within
      Melting in the tears from the stars in your eyes
      Shining still brighter, still fainter through the darkness
      The love between you and me, a trace of dawn

    2. #2
      rach12's Avatar
      rach12 is offline Big Ba-Dah Big Boom
      ---
       
      Join Date
      May 6th, 2003
      Location
      Colorado
      Posts
      959
      Female - Atheist
      Mentioned
      0 Post(s)

      Re: Information Theory for Dummies

      What's "log?"








      Kidding.


      Thanks, that was helpful(???). Unfortunately, now my head aches so I'm off to Tylenol the pain.

    3. #3
      oxmixmudd's Avatar
      oxmixmudd is offline tWebber
      Nerdy
       
      Join Date
      August 23rd, 2005
      Location
      southeast
      Posts
      7,774
      Male - Christianity
      Mentioned
      0 Post(s)

      Re: Information Theory for Dummies

      Quote Originally posted by FreezBee
      As a service to those who may have problems following the arguments in certain threads in this forum, here is a few bits of information theory of relevance.

      Assume a (discreet) probability space L and an event E in L with probability P(E). The information associated with E is then

      I(E) = -log(P(E)) for all E in L

      where "log" refers to the binary logarithm. The reason for the use of the binary logarithm can be given as follows. Assume that we are to guess a (positive, whole) number between 1 and 128, both inclusive. Our probability space here consists of an event for each of the 64 possible number, namely the choice of that number to be the one to guess for. Assuming all numbers to be equally likely and calling the number chosen x we have

      I(x = N) = -log(1/128) = 7 for all N in {1, 2, ..., 128}

      We might start out guessing at 1, then 2, and so on, or some other scheme with guessing at one possible number at a time. With such a scheme, the minimum number of guesses required is 1 (if we are right in the first try), the maximum number of guesses required is 127 (if we guess wrong for the 127th time, we have only 1 choice left and know it's the right one), and the average number of guesses required is 64. We can minimize the maximum number of guesses required by bisecting the remaining choices before each guess - bisecting, because we are only allowed a simple yes/no-question. Therefore the binary logarithm.

      In return, by the same token - it's the maximum number of guesses required - we can say that I(E) is the information content of the occurence of E. Note here that a smaller probability corresponds to a higher information content, and vice versa. In particular, if an event is certain to occur (probability = 1), then there is an information content of 0 associated with it (-log(1) = 0).

      This last is of importance in ID (Intelligent Design) theory. In a deterministic world model, all events that will occur are certain to occur, so no new information can be added. That's the claim. However, the problem is, that probabilities and information are not the same. A scientist observing an event, even if that event is fully deterministic, can gain information, because the scientist might not beforehand have had enough knowledge to predict exactly what event would occur. The scientist might have had a qualified estimation of a certain number of possible (as in not excluded by his then current knowledge) events and have assigned estimated probabilities to them. Which event actually happened to occur did increase the information that the scientist had, but that is a posteriori information, not a priori information. This is where IDist go wrong, they try to treat information as a given substance similar to matter and energy, but it doesn't work.

      Say I was to compute the 100th prime number. It's 100% certain which number I end up with (assuming I do my calculations correctly), but that does not mean that the answer provides me with no information.

      Talking about this, let's assume we are to write a computer program to calculate prime numbers up to, say, the 10,000th. One way would be to precalculate them somehow and then enter them into a table, the program then simply makes a table lookup each time it's asked for the Nth prime number. Another way is to write an algorithm that will compute the Nth prime, for instance by every time computing all the prime numbers up to and including the Nth, and for each candidate diving that number by all the numbers smaller than or equal to its square root. As of now, there is no other known algorithmic way of computing prime numbers. The first approach takes up some space in memory, but it's very quick to calculate the primes, and the speed is independent of N. The second approach takes up very little space in memory, but it's very slow, and it becomes as N increases. Which approach is the most intelligent?

      We'll not answer that question, but look at these two programs from another angle. The first program requires at least as many bits in its representation (including the table of prime numbers) as its possible output (here only counting a prime number the first time it's output), whereas the second program requires very few bits in its representation, especially compared with its output. In ID terminology only the second program is a specification, since a specification must be shorter thatn what it specifies. Say that what is specified requires T bits, and the specification requires S bits, if S + 500 < T, then we have a complex specification, otherwise not.

      Why exactly this number 500? Well, first we notice it's information, so it must correspond to a probability p such that 500 = -log(p). We calculate p to be 10e-150 (ignoring rounding issues as always in the exact sciences), and what's so sacred about this number? It's William Dembski's UPB (Universal Probability Bound), the lowest probability we ever need to consider, and it comes around like this: the number of elementary particles in the universe is 10e80 (is that so certain? we might ask), the maximum possible number of elementary particle transitions (the inverse of the Planck time) per second is 10e45 (this is assuming time to be quantified rather than continuous), and the number of seconds in a billion times the current age of the universe is 10e25 (Dembski is an Old Earther!). Multiplying together we get

      1/UPB = 10e80 x 10e45 x 10e25 = 10e150

      Ok, but that is of course only of any relevance assuming that all events are equally probable!

      Dembski also distinguishes between replicational ressources, the relevant (as determined by whom?) ways an event can occur, and specificational ressources, the relevant (as determined by whom) ways an event can be specified. These together comprise the probabilistic ressources. Let's assume again an event E, which lies within a region R of events. We will call R the rejection region of a hypothesis H, if the probability P(R | H) < a, where a, the significance level, is usually chosen to be 0.1, 0.05, or 0.01. Dembski suggest the following value: α = ˝ ÷ (RR × SR), where RR = #replicational ressources, and SR = #specificational ressources.

      Ok, so now you should be a bit better informed about, what Jorge is referring to


      - FreezBee
      Thanks much! But I am trying to decide if it makes sense that all forms of information can be ordered in such a way that binary search is a resonable way to minimally chose a specific instance.

      Jim

    4. #4
      FreezBee's Avatar
      FreezBee is offline Blu Ray Laser Phaser
      Asleep
       
      Join Date
      February 8th, 2006
      Location
      Copenhagen
      Posts
      32,406
      Male - churchburner
      Mentioned
      0 Post(s)

      Re: Information Theory for Dummies

      Quote Originally posted by oxmixmudd
      But I am trying to decide if it makes sense that all forms of information can be ordered in such a way that binary search is a resonable way to minimally chose a specific instance.
      Given that you can only ask yes/no-questions, a binary search will minimize the maximum number of guesses you have - the minimum will be the same or one less. In the example given, where you have to guess a number in the range 1 to 128, you can bisect exactly 7 times, and that will be the number of guesses you'll need independent of the correct number. If you particularly like the number 52, you can take a shot at that, and maybe it's the correct number, whereby you got it in one try, so binary search will not give you the minimum minimum, just the maximum minimum. Of course, when using binary search in a situation, where you do not have a uniform distribution, you need to balance things out such that you split into 50% vs. 50%. Say that you have to guess a number between 1 and 10, but you know that

      P(1) = P(2) = P(3) = P(4) = 5%,
      P(4) = P(5) = P(6) = P(7) = 10%, and
      P(7) = P(8) = P(9) = P(10) = 15%.

      In this situation, you would not split into {1, ..., 5} and {6, ..., 10}, but into e.g. {1, 8, 9, 10} and {2, ..., 7}, both groups having a combined probability of 50%.


      Quote Originally posted by rach12
      Thanks, that was helpful(???). Unfortunately, now my head aches so I'm off to Tylenol the pain.
      Well, all I can say that I hope it didn't make your head ache 10150 times as many as reading one of Jorge's posts would have done.


      - FreezBee
      From darkness into light
      Like icy shards from the broken mirror within
      Melting in the tears from the stars in your eyes
      Shining still brighter, still fainter through the darkness
      The love between you and me, a trace of dawn

    5. #5
      oxmixmudd's Avatar
      oxmixmudd is offline tWebber
      Nerdy
       
      Join Date
      August 23rd, 2005
      Location
      southeast
      Posts
      7,774
      Male - Christianity
      Mentioned
      0 Post(s)

      Re: Information Theory for Dummies

      Quote Originally posted by FreezBee
      Given that you can only ask yes/no-questions, a binary search will minimize the maximum number of guesses you have - the minimum will be the same or one less. In the example given, where you have to guess a number in the range 1 to 128, you can bisect exactly 7 times, and that will be the number of guesses you'll need independent of the correct number. If you particularly like the number 52, you can take a shot at that, and maybe it's the correct number, whereby you got it in one try, so binary search will not give you the minimum minimum, just the maximum minimum. Of course, when using binary search in a situation, where you do not have a uniform distribution, you need to balance things out such that you split into 50% vs. 50%. Say that you have to guess a number between 1 and 10, but you know that

      P(1) = P(2) = P(3) = P(4) = 5%,
      P(4) = P(5) = P(6) = P(7) = 10%, and
      P(7) = P(8) = P(9) = P(10) = 15%.
      Yes, but one must be able to group the instances and apply an order to them to do so. I was just wondering if one can always do that for information. Certainly any finite set of anything can be ordered, but does it make sense to define information complexity this way? Just pondering (probably out of ignorance). I don't know much about information theory, but it seems to me multiple layered encodings might be hard to group in a way that can be ordered. Further, the information content of a photograph goes far beyond the bits required to represent it (or does it). If I look at a photo of a mountain range, I will see colors, mountains, rock types, information about the kind of day it was etc. etc. Can all that really be reduced to a number like 21 (2 million pixels)?, wheras a photo of an empty, featureless space would come up similarly as it also would be represented by 2 million pixels. The probability of one string of bits over the other is the same, but the information content(and complexity) is indeed quite different.


      Jim

      ETA: of course, if all we are defining is the maximun complexity, one could account for that - but then what does that do for the design inference? If we only know a sets maximum possible complexity, but not its actual complexity, we can't really say much about whether it was designed.
      Last edited by oxmixmudd; February 28th 2006 at 01:31 PM.

    6. #6
      FreezBee's Avatar
      FreezBee is offline Blu Ray Laser Phaser
      Asleep
       
      Join Date
      February 8th, 2006
      Location
      Copenhagen
      Posts
      32,406
      Male - churchburner
      Mentioned
      0 Post(s)

      Re: Information Theory for Dummies

      Quote Originally posted by oxmixmudd
      Yes, but one must be able to group the instances and apply an order to them to do so. I was just wondering if one can always do that for information. Certainly any finite set of anything can be ordered, but does it make sense to define information complexity this way?
      You are right that we are not always fortunate enough to know the probability distribution, and that's one of the flaws with ID - the assumption is a random distribution, where we do not know better. It may be ok as an assumption, but you can't use an assumption for final proof! For instance, molecules do not attach to each other randomly.

      Quote Originally posted by oxmixmudd
      I don't know much about information theory, but it seems to me multiple layered encodings might be hard to group in a way that can be ordered. Further, the information content of a photograph goes far beyond the bits required to represent it (or does it). If I look at a photo of a mountain range, I will see colors, mountains, rock types, information about the kind of day it was etc. etc. Can all that really be reduced to a number like 21 (2 million pixels)?, wheras a photo of an empty, featureless space would come up similarly as it also would be represented by 2 million pixels.
      If you have a photo, a specification of the information in that photo would be some string of bits much shorter than the number of grains in the photo. For the empty space, the phrase "empty space" would do, but for the photo of the mountain range, you could probably not come up with a loss-less specification. A jpeg encoding might not loose much information - the eye might not be able to easily discern the difference.

      Quote Originally posted by oxmixmudd
      The probability of one string of bits over the other is the same, but the information content(and complexity) is indeed quite different.
      True, but the main point in ID is whether you can come up with a specification that is more than 500 bits shorter than what is specified. IDist say that that's not possible by natural methods, it requires an intelligence to do that.

      Quote Originally posted by oxmixmudd
      ETA: of course, if all we are defining is the maximun complexity, one could account for that - but then what does that do for the design inference? If we only know a sets maximum possible complexity, but not its actual complexity, we can't really say much about whether it was designed.
      No, you are quite right here


      - FreezBee
      From darkness into light
      Like icy shards from the broken mirror within
      Melting in the tears from the stars in your eyes
      Shining still brighter, still fainter through the darkness
      The love between you and me, a trace of dawn

    Similar Threads

    1. A Statistical Approach to Multi-Level Information Theory
      By FreezBee in forum Natural Science 301
      Replies: 14
      Last Post: August 11th 2009, 12:45 PM
    2. Information theory and evolution
      By SteveF in forum Natural Science 301
      Replies: 12
      Last Post: January 10th 2009, 11:53 AM
    3. For the benefit of Jorge: Information theory for dummies, take 2
      By FreezBee in forum Natural Science 301
      Replies: 57
      Last Post: April 11th 2007, 01:02 PM
    4. An actual example of Information Theory...
      By TheGreenMan in forum Natural Science 301
      Replies: 4
      Last Post: March 31st 2006, 04:32 PM
    5. Cladistics and Information Theory
      By FreezBee in forum Natural Science 301
      Replies: 6
      Last Post: February 22nd 2006, 08:27 AM

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •