Bits and Bytes

The information in disk files, in the computer's main memory, in the central processing unit (CPU), or anywhere else in the computer system is represented Digitally.

This form of information is called Digital information as opposed to Analog information. Digital information can only take on discrete values (which can be associated with whole numbers: 1, 2, 3, ...); while analog information can take any continuous value within a range (1.52, 1/3, ...).

An example of these two types of information is illustrated by the difference in representation between a digital watch and the classical watch with hands. The classical watch with hands is analog because it can represent the time in a continuous manner. The hour, minute and second hands can be in any position around the circle which is the face of the clock.

In a digital watch, instead of a second hand, two digits are displayed to give the seconds portion of the time. The analog seconds indication, the position of the seconds hand, can take any of the infinitesimally different angles or positions around the circle while the digital watch can only take one of the 60 quantized or discrete values running from: 00 to 59. If greater precision is required in the digital watch, one can always have tenths or hundreths or even finer divisions of the second displayed.

This is a very real advantage. Even though an analog signal or value can take on any magnitude in its range, it is subject to noise and other sources of error which significantly limit the attainable accuracy . A Digital value can be as accurate as is needed. For example, 20 digits accuracy were needed for the computers to calculate the trajectory and engine firing timing and direction for the astronauts when they went to the moon. Anything less would have meant missing the aimed for spot, or worse, crashing or missing the moon altogether!

Returning to the information in the computer, it is digital, but, instead of being in base 10 like we usually represent numbers (or base 60 in a digital watch), it is represented in base 2 in most modern computers. This means that the smallest form of information is what is called a Bit, which is a contraction of the phrase 'Binary-Digit'. A Bit can only take on the values of:

0 1

instead of being a (Decimal) Digit which takes the values:

0 1 2 3 4 5 6 7 8 9

The reasons why a Binary instead of a Decimal representation is the preferred form, are technical and beyond the scope of this tutorial. But essentially, it is easier to build circuits where each circuit element has only to distinguish two states: OFF and ON, rather than ten different states.

The meaning of the two different digital states can be any two (usually opposite - but not necessarily so) meanings, for example:

OFF	ON
NO	YES
0	1
COLD	HOT
LEFT	RIGHT
DOWN	UP
RED	BLUE

What is usually important, is that transitioning between these two different states can be used to represent some process or part of a larger information change of state.

Everything in a computer, including the instructions for the program, the text you type into the computer (e.g. this tutorial) is stored or represented as Bits.

The Bit is the smallest quantity of information but it is too small when one needs to work with most quantities; hence Bits are also used in groups of, usually fixed size. One of the commonest of such groups is called the Byte. There are usually 8 bits in a Byte, although there are exceptions to this which are becoming rarer and rarer in actual occurrence - there are (or were) computers with 9 bit bytes. The bits in a byte are sequentially ordered, and can be numbered from the right or the left - usually starting with zero. A byte usually consists of 8 bits - which is 2 to the third power, that is: 8 = 2 * 2 * 2.

most
significant
bit

B Y T E

least
significant
bit

bit 7	bit 6	bit 5	bit 4	bit 3	bit 2	bit 1	bit 0
128	64	32	16	8	4	2	1

For reasons of simplicity in the design, the groupings of bits into larger sequentially ordered collections is usually in powers of 2:

Bytes are usually grouped into collections which are also powers of two: 2, 4, 8, 16, 32, etc.

In modern Positional Notation, the value of a Digit(Bit) in a Decimal(Binary) number depends upon its position within the number. For example, in the decimal number 4307, the 3 is implicitly multiplied by 100. Note the importance of the zero to its right to preserve this fact. Each Digit is implicitly multiplied by a power of the base, i.e.

     7 * 1     [   1 = 10 to the power 0]
     0 * 10    [  10 = 10 to the power 1]
     3 * 100   [ 100 = 10 to the power 2]
     4 * 1000  [1000 = 10 to the power 3]

Thus, in a Binary number, the least significant Binary-Digit, or Bit, has a weight of 1 (as in all bases) or 2 to the power 0, the next has a weight of 2 (2 to the power 1), the next has a weight of 4 (2 to the power 2), the next 8, then 16, 32, 64, 128 ...

For example, to count to 16 in both Decimal and Binary, would look like the following:

     Decimal  Binary        4 3 2 1 0 (power of two)
     -------  ------        ------------------------
        0         0      =          0
        1         1      =          1
        2        10      =        2
        3        11      =        2+1
        4       100      =      4
        5       101      =      4+  1
        6       110      =      4+2
        7       111      =      4+2+1
        8      1000      =    8
        9      1001      =    8+    1
       10      1010      =    8+  2
       11      1011      =    8+  2+1
       12      1100      =    8+4
       13      1101      =    8+4+  1
       14      1110      =    8+4+2
       15      1111      =    8+4+2+1
       16     10000      = 16

It is from the Binary Number system that we get the term 'K' (pronounced KAY), which is 2 to the power 10, or 1024. Thus 16K is 16*1024 = 16384. It is a good approximation (i.e. within 2.4 per cent) to take the value of a '1K' to be about 1000. Memory size is usually given as a multiples of:

   1024 bytes, which are called Kilo bytes (KB)
   1024*1024 bytes, which are called Megabytes (MB)
   1024*1024*1024 bytes, which are called Gigabytes (GB)
   1204*1024*1024*1024 bytes, which are called Terabytes (TB)

A byte is an ordered collection or string of 8 bits. As was mentioned earlier, 8 was chosen because it is a power of 2. Another reason 8 was chosen for the number of bits in a byte, is that it gives a large enough numerical range to represent all the characters (letters, numbers, punctuation, etc.) in the computer's alphabet (see ASCII Codes below).

Each key stroke on the keyboard produces one byte of information which can be sent to or received by the computer. A byte is a number which is in the range from 0 to 255. Although, a single keystroke producing a character in the ASCII set has a value only in the range 0 to 127 (which needs only 7 bits).

The number of bits required, must allow for a number large enough to represent all of the possibilities of the:

       letters (upper and lower case)
       digits
       punctuation
       special characters (e.g. $ _ ~)
       control characters (e.g. RET, SP, TAB, FF, etc.)

Since each keystroke or letter is thus a number, if the numbers which represent the different letters in the alphabet are appropriately chosen, then putting things in alphabetical order is merely a matter of comparing numbers and making decisions based on the results of the numerical comparison. Thus to alphabetize the two names INDIA and EARTH, the computer would look at the numerical value for 'I' and the numerical value for 'E' in its internal storage, and finding that the number for 'E' is less than the number for 'I' conclude that EARTH comes alphabetically before INDIA.

The set of these numbers and their associated characters has been standardized so that different manufacturers and programmers will have a common 'alphabet' to work with. This alphabet is called the the ASCII set. ASCII stands for the "American Standard Code for Information Interchange". There are 128 characters in the ASCII computer alphabet.

Some of these characters are the normal alphabet (both upper and lower case) the digits, punctuation, other special printable characters plus some non printable characters which are used for conveying control information to the computer and output devices (e.g. printers and displays).

Examples of these special codes are the ones for:

        Carriage Return,
        Space,
        (Horizontal) Tab,
        Form Feed (which causes the next character to be placed at
                  the top left corner of the next page)

plus some other control characters whose meaning depends upon the device, program or context using them. These special ASCII codes do not have a printed form like the letters. They normally cause the machine to take an action - e.g. go to the next line, or column, or TAB position, or next page. More precisely, they cause the next printed character to appear in other then in the next column.

On most computer or terminal keyboards, in addition to the usual letters, numbers and punctuation there are some special keys to enter these other ASCII codes. For example, many computer keyboards have a special key called the CTRL or Control key which is just like the SHIFT key in that it tells the computer to interpret the keys in a different way. All characters in the ASCII character set or range can not be entered by all keyboards. In fact there are very few keyboards that can enter all 128 ASCII codes with a single keystroke.

Note that a lower case 'a', an upper case 'A' (produced by holding the SHIFT key and typing the 'A' key), and Control-A (CTRL-A, sometimes written as: '^A' - produced by holding the CTRL key while typing the 'A' key) are each considered to be a single keystroke. And all three are different from each other. Please note that '^A' is usually meant to mean CTRL-A which is different from typing the '^' key followed by the 'A' key!.

When using a typewriter, frequently people use a lower case 'l' for the digit '1'; or sometimes interchange zero, '0', and capital 'O'; and occasionally interchange '2' and 'Z'. To the computer all of these are absolutely distinct and the computer will not know or care that they look alike.

Some keyboards have arrow keys or other 'function' keys which may be special or unique to a particular computer. These keys when typed may generate one of the ASCII codes but on some machines they may generate a signal that only the given computer can receive and interpret. The generated information may be beyond the 7 bit ASCII set, i.e. in the range 128 to 255, or even beyond the 8 bit range (greater than 255), or such keys may generate multiple bytes. The latter is what happens on PC type computers - a special key yields two bytes, the first with all bits zero and the second a non ASCII code specifying which special key has been pressed.

The byte values between 128 and 255 were given a meaning by IBM for some graphics characters or accented letters or mathematical symbols. These codes have become a defacto standard because of the over 30 million PCs in the world and the need for printers to print the expected pattern of ink, but are not supported somuch any more because of the availability of many fonts and international character sets which use two bytes per character.

Some of the symbols for the characters whose codes are in the range 0 to 31 are shown for completeness since their actual name (e.g. DC2) are of no importance here, except to note that 18 is the code for CTRL-R or ^R.

A SP (space is not nothing - it separates the characters on either side of itself. It takes a byte for each space. Thus if you have unneccesary spaces in a file they increase the file size. Similarly, although a RET or FF do not usually appear as some symbol on the screen or printed page, they do cause the position of the next character to be on the next line or page. More will be said about these special characters when the lexical details of the 'C' language are described later.