From Silicon to Code

The basis of computation is Binary

First we must define the fundamental basis of how information is processed in any modern computer. A computer is any device that processes information. The most basic unit of information we use today is the binary bit. A bit is actually the smallest possible unit of information. This is because a bit has exactly two states. Two states is the minimum state difference neccesary to convey any information. It's not possible to convey any information with something that only has one state because that would be a constant signal that never changes. Conveying information with more than two states is possible, but more complex. Thus, conveying information with two states is the simplest method. This information theory is why binary was selected as the base unit for information in modern computers. This also explains the etymology behind the word. The "Bi-" prefix meaning two and the suffix "-nary" comes from the Latin suffix "-narius" which became the english "-nary" which means "related to a certain number of items". In this case the "items" are states of information. Furthermore, binary was selected because of the ease of which it translates to the properties of electicity. Electrical switches and circuits have two states, on and off. Translating these states to binary is simple. Developing an electrical device that has more than two-state behavior becomes more complex. This is generally true for mechanical computers as well as computers that use other things to delineate state.

First beginnings woven into the fabric of time

Much of the inspiration for binary actually came from a textile machine. The Jacquard loom, invented by frenchman Joseph Marie Jacquard in 1801 used a system of vertical punchcard stacks and threaded needles to automatically knit patterns into textiles. The punchcard either allowed or blocked a threading needle from going through the card for each woven layer in the fabric pattern. This Loom was the inspiration for Charles Babbage's Analytical Engine. Charles Babbage was a 19th century inventor and computer scientist well known for his genius in mathematics and mechanics. Charles began working on the design of the Engine in 1834. He continued to refine the design until his death in 1871. While he was unable to build the physical device during his lifetime the idea was revolutionary and expanded on Jacquard's loom idea much further. Charles' engine borrowed the idea of punchcards representing data, but Charles expanded on this by designing the engine so that it could perform general mathematics rather than only modifying Loom patterns. Charles also separated the punchcards into two stacks, one for data, and one for the instructions in his machine. He intended for the punchcards to follow a program that could include steps, loops, and logical checks. This was far ahead of it's time because it would've allowed the machine to be reprogrammed.

The War period, the Enigma, and the loss and re-discovery of computing science

The evolution of these mechanical computers from Charles' general purpose design took a strange backwards turn after his death stopped further progress. Unfortunately Charles' design would go unnoticed for years afterwards until his work was re-discovered in the 1930s. In his 1800s era the mechanical limitations of contemporary technology did not allow him to build a general purpose Analytical Engine. He had no feasible way to make his design electronic either. The underlying goal of a general purpose programmable machine with punchcards did survive, and it inspired a new generation of scientists. The next computer science revolution would come during WW2 when Alan Turing and his team at Bletchley park designed his "Bombe" or "logic bomb" machine to decode Nazi Enigma encrpyted radio transmissions. His Bombe is widely known as the world's first computer. Technically, Polish computer scientists and human computers had also developed "Bombes" to decode earlier versions of the Enigma machine before and during the Nazi invasion of Poland. These brave scientists transmitted their findings to British intelligence before and during the war.

The Enigma Machine worked by encrypting messages for radio transmission in the German alphabet one letter at a time. It did so using various rotor and dial settings that would scramble the German plain text alphabet into an encrypted message where each German letter was scrambled to be a different German letter. The settings for the dials were sent securely in advance via couriers. These settings were closely guarded secrets and were ordered to be destroyed before capture. The rotor settings determined the encryption and decryption of the plain text German alphabet. This meant Enigma was a symmetric cipher, one that encrpyts and decrypts a message with the same settings. These rotor settings were known as the key, because igf you knew them, you could read the secret messages. The rotor settings list held by German officers was referred to as a "key list" because it contained a long list of secret rotor and dial settings. The keys were changed daily according to these secret lists. The Germans were wary of their secrets being leaked because they discovered the Polish work to decode them. The Germans changed their keys so frequently to make it impossible to test all the rotor combinations by hand. Due to the fighting it was hard to gain physical access to the key lists. The Allied forces decided to build logic machines in Britain to test Enigma setting combinations automatically. These machines would try to deduce the key for that day within 24 hours. Messages older than 24 hours lost their tactical relevance quickly. The Enigma machine had two major flaws which sped up the automated rotor setting guessing process. The first flaw was that the Engima machine could not encode a German letter to itself. This meant that any Engima rotor combination or "possible key" that tried to encrypt a German letter to itself was not the real key. Since the Enigma encoded one German letter to every other German letter, this reduced the possibilities significantly. The second major flaw is that german messages often contained patterns. These patterns were called "cribs" by the British. These patterns would be things like common German words or phrases used in transmissions, like "WETTER" which is german for "weather". These patterns were visible because lots of these transmissions included weather reports or other common phrases. By analyzing the pattern of the encrpyted transmissions and by using or guessing what "cribs" might be present, the Allies could eliminate possible rotor combinations (keys). Exploiting these flaws manually was still impossible because humans would still not be fast enough to test all the possible rotor combinations by hand even with shortcuts. Alan Turing had a genius idea to create his own Bombe to automate the guessing process. Alan's Bombe took an Enigma encrpyted message as input and it used a series of gears and wires to test Enigma keys. The test was simple, the machine would stop and "halt" if the rotor combination under test was a valid key, meaning it followed the "cribs" pattern and it didn't encrpyt any German letter to itself. The machine would continue moving through tests via gears if the combination under test was not a valid input. This process greatly reduced the number of inputs that needed to be tested. All the valid keys found by the Bombe were offloaded to human "computers" who used replica Enigma machines to find the correct key. This breakthrough allowed engima to be decoded within 24 hours and thus allowed the allies to decode all future Nazi transmissions without capturing the key sheets from German Nazi officers.

The key takeaway from this device was that a logic process could be automated by a machine. However this was simply the first of many such breaktrhoughs. Alan's Bombe was revolutionary but suffered from many limitations. It was a specialty built machine, meaning it only performed the specific logic check that it was hard wired to do. If Alan had wanted to check a different kind of logic check, he would have to physically modify the machine. Additionally the Bombe did not use binary and did not automatically store the output. Instead it's "output" was whether it halted or not, indicating a "True" / "False" result for the logic check. The rest of the decoding process was still manually done by humans.

After the war, following Charles Babbages' Dream: the general purpose computer

Compared to Charles' Analytical Engine, Alan's Bombe was simplistic. The Bombe did not generalize to all of math, and was hard wired. Charles had envisioned a generally programmable machine long before Alan's breakthrough. Charles' design was more advanced because it envisioned using punchcards as the program instructions and data. The Bombe could only process one kind of input, German messages encrpyted by the Enigma Machine. Charle's engine imagined a machine that could accept any binary input that could program all of mathematics. Alan's Bombe could only output one type of data, "halt" or "no halt". Charle's engine could output binary data that could theoretically represent all kinds of information. Alan's Bombe could not be re-programmed using different instructions. The logic that the Bombe used to check Enigma combinations was baked into the circuitry of the device. Charle's engine could perform all kinds of different logic checks. Charle's engine could be programmed with stacks of punchcards to do any sort of logic check in binary. While Charle's engine was never built, it was a theoretical leap ahead of Alan's Bombe developed more than a century earlier. Luckily, another genius named Von Neumann would take the lessons learned from both Alan and Charles and apply them to his breakthrough.

Von Nuemann was the genius who created the architecture for the Electronic Numerical Integrator and Computer (ENIAC). The ENIAC was the first programmable, electronic, general-purpose digital (binary) computer. First built in 1945, it was not yet capable of being programmable without human intervention. The ENIAC used IBM brand punchcards for input data and output data similar in theory to Charle's Analytical Engine design. The ENIAC originally used a plugboard to change programs. This manual method of re-programming was tedious and required manual human experts similar to Alan's Bombe. Later in April 1948 the ENIAC was upgraded to allow IBM punchcard inputs to reprogram program instructions. This upgrade marked the first time an electronically programmable general purpose digital computer was created. This was the dream Charles' had spent years designing, now finally brought to life through the dedicated work of teams of scientists. The ENIAC performed general purpose binary math instructions using triode vaccum tubes which stored binary data as the On/Off state of the vaccum tube. It was used for various military computations such as artillery trajectory table calculations.

These breakthroughs form the long chain that led to the development of the first electronic, general-purpose digital binary computer. In order to understand how these devices did math, an explanation of binary mathmematics must be given:

How Binary can recreate all of mathematics

In mathematics, Binary is represented as 0 and 1. Numbers in binary are represented in a base-2 system. A "base" system is a system of counting numbers that uses a number of base "states" to count things and perform computations. There have been many historically used base-systems for many different types of counting including base-60, base-10, base-2 and base-16. In base 2 each digit represents a power of two. For any base system this generally holds true. In base 10 each digit represents a power of 10. Each digit in base 2 has 2 possible options, 0 and 1. This pattern also generally holds for any base system. Each digit in base 10 follows the pattern with 10 possible options, the numbers 0 through 9. It is possible to convert from any base system to any other base system, such as from base-2 to base-10 or vice versa. Fundamentally a base-system is just a different way to represent the same numbers. Here is how to understand bases and convert between them:

The number 123 in base 10 consists of three digits. We could also write this number as \( 1 * 10^2 + 2 * 10^1 + 3 * 10^0 \). This is how base 10 works, each digit is equal to the value of the digit itself multiplied by 10 to the power of the digit's position from the decimal point (or from the "tenths" place). If the digit is immediately to the left of the decimal point in the "tenth's place" the power of ten used is \( 10^0 \) which equals 1. This is called the "ones place". Moving a digit to the left we have a 2 in the "tens place", the power of ten used is \( 10^1 \) which equals 10. Moving another digit to the left we have a 1 in the "hundreds place", thus the power of ten used is \( 10^2 \). This representation can be extended for any number.

The method of representing a number as a collection of digits and base powers can be extended to other base systems like base 2. The base-10 number 123 can also be written as \( 64 + 32 + 16 + 8 + 2 + 1 \). Each of these numbers in the addition sequence is a power of two. Replicating our pattern earlier with base-2 instead of base-10 we have:

\( 123 = 1 * 2^0 + 1 * 2^1 + 0 * 2^2 + 1 * 2^3 + 1 * 2^4 + 1 * 2^5 + 1 * 2^6 \)

If we simplify the powers of two we see a sequence similar to our addition sequence earlier. This sequence also contains powers of two that are not in the addition sequence used earlier to add powers of two up to \( 123 )\:

\( 123 = 1 * 1 + 1 * 2 + 0 * 4 + 1 * 8 + 1 * 16 + 1 * 32 + 1 * 64 \)

We can also observe some other patterns. The left hand side of each multiplication pair is always 1 or 0. This corresponds to our two possible digit values in base 2. Additionally, the right hand side of each multiplication pair is always a power of two which increases by a factor of 2 for every digit in the sequence. We can re-construct any base-10 number this way in base-2. In binary notation we write a 1 when the value of that digit is 1 and 0 when it is zero. Increasing powers of two are written right to left just like how increasing powers of ten are written right to left for base 10 numbers. That means this binary number is written as:

\( 123 = 1111011 \)

The final binary number follows the same sequence as the previous multiplication sequence. The binary number is written as the sequence of the left hand part of those multiplications in the previous equation. This method is used to go between base-10 and base-2 for any integer number.

We can also reverse this process to go from base 2 to base 10. Using the same binary number \( 1101111 \) we will write this number as a sequence of digit value multiplied by the digits' "place" in the sequence. Binary numbers written this way are read right to left just like regular base 10 numbers. The rightmost 1 is in the "ones" place, the second rightmost 1 is in the "twos" place. the thirdmost 1 is in the "fours" place, and so on. \( 1111011 \) is thus written as:

\( 1101111 = 1 * 1 + 1 * 2 + 0 * 4 + 1 * 8 + 1 * 16 + 1 * 32 + 1 * 64 \)

We can then see that this is the same sequence from earlier, which we know equals \( 123 \) in base 10.

Binary can do more than just represent numbers differently, they can also be added, subtracted, multiplied and divided just like base 10 numbers.

Binary operations follow the same mathematics rules as base-10 operations. Two numbers are operated on in left to right notation. The same rules for associativity, communtivity, and PEMDAS ordering apply to base-2 and base-10. Both numbers are written such that their digit values increase in value by a factor of the base number going from left to right. These places extend to any power of the base, which allows each representation to represent infinite value. Binary numbers can be written as any number of digits but some standards have been agreed upon. a Byte is eight bits long and it is the smallest standard representation of any binary number. Two Bytes (16 bits) is referred to as a "half-word". Four Bytes (32 bits) is referred to as a "word". Eight Bytes (64 bits) is referred to as a "double word" or a "double" for short. The pattern between these standard lengths is that the size keeps doubling. Other sizes can be used but these ones are the standard sizes of data used today in most computers, circuits and data theory.

Binary Addition

Binary addition follows the same rules as base 10 addition but it has simpler outcomes because there are only two states for each digit. When you add two binary numbers you align their digits just like you would for base 10 numbers. To add two base 2 numbers like \( 00000110 )\ and \( 00000101 \) you first line up their digits vertically. Doing this we have \( 00000110 + 00001101 \). notice that the first digits in the "ones place" to be added are 0 and 1. The second digits in the "twos place" to be added are 1 and 0. The third digits to be added in the "fours place" are 1 and 1. The fourth digits to be added in the "eights place" are 0 and 0 and so on. This example was chosen because it represents the four possible combinations of two binary digits. These combinations can be represented in a truth table, and each combo has a specific solution. Going through each combo:

\[ \begin{align} 0 + 0 &= 0 \\ 1 + 0 &= 1 \\ 0 + 1 &= 1 \\ 1 + 1 &= 10 \end{align} \]

The last digit addition means that \( 1 + 1 \) will "carry over" into the next left digit. This logically makes sense because 1 in base ten and base two is 1 in the "ones place" for both representations. \( 1 + 1 = 2 \) in base ten which puts 2 in the "one's place". In binary, two is represented by a 1 in the "two's place" thus we write it as \( 10 \). The carryover digit is similar to adding 6 and 5 in base ten. 6 + 5 carries a "1" into the "ten's" place for a result of 11. If we follow our addition example digit by digit then 00000110 + 00000101 = 00001011. We can check this answer by converting the equation to base ten. \( 00000110 \) equals 6. We can see that it equals \( 4 + 2 \) because there are 1's in the two's place and the fours place, and zeros everywhere else. \( 00000101 \) equals 5. We can see it equals \( 4 + 1 \) because there are 1's in the one's place and the four's place. \( 6 + 5 = 11 \) is the expected result of our addition. We can see that \( 00001011 \) is indeed equal to eleven because we have 1's in the eight's place, the two's place, and the one's place, and zeros everywhere else. The value represented is thus: \( 8 + 2 + 1 = 11 \). Its important to note these additive patterns hold for all digits in binary numbers. This means you can add any binary number to any other binary number of any length by following just these four rules. This is equivalent to three rules if you consider that order of addition doesn't matter because of associativity, because \( 0 + 1 = 1 + 0 \). This simplification of addition is what makes computers powerful. If you wanted to do the same thing with base ten you'd have to have one rule for all 55 unique combination of base ten digits.

Binary multiplication

Binary multiplication has the same properties as base ten multiplication. Going back to our example with 6 and 5, or 00000110 and 000000101 in binary, we stack these two vertically to perform the computation. \( 5 * 6 = 30 \) in base ten. \( 00000110 * 00000101 = 00011110 \) in base two. multiplication in binary also has four possible combinations for each digit like addition. similarly, one combination has a result that "carries over" into the next leftmost digit. Unlike addition, the place of the digit we're multiplying, whether its in the ones place, twos place, or any other place, determines the result. To calculate the result we double the value of the result multiple times depending on the "place" of both operands in the multiplication. Luckily, binary numbers can easily be doubled by shifting them left. They can also be divided in half by shifting them right. This is a pattern in base ten as well. For example: \( 200 * 200 = 2 * 2 * 100 * 100 = 4 * 10,000 = 40,000 \). Notice how when we did that multiplication we first multiplied the digits and then shifted the result leftwards four places (equivalent to multiplying by 10,000). The answer is 40,000, because 2 * 2 is 4, and both twos were digits in the hundred's place. Digits in the hundreds place are shifted left of the decimal point by two places. When we multiply two "hundred's place digits" together we add their shifts together. Since both digits were shifted left two places, the total shift of the multiplication must be \( 2 + 2 = 4 \) places.

Unlike addition we also have to take into account digits that aren't aligned vertically. In base ten, the calculation \( 2 * 20 = 40 \) is an example of this. going digit by digit we have 2 and 0 in the one's place. \( 2 * 0 = 0 \) and clearly the result of \( 2 * 20 \) is not just 0. We also have to multiply each place with every other place. The one's place also multiplies the ten's place, and the hundred's place, and the thousand's place, and so on. As an example lets Multiply two base ten numbers with three digits each: \( 123 * 456 \). We multiply the three digits of 123 with each digit of 456 one by one. The calculation is sub-divided into 9 different single digit multiplications.

\[ \begin{align} 123 * 456 &= \\ 100 * 400 &+ \\ 100 * 50 &+ \\ 100 * 6 &+ \\ 20 * 400 &+ \\ 20 * 50 &+ \\ 20 * 6 &+ \\ 3 * 400 &+ \\ 3 * 50 &+ \\ 3 * 6 &= \\ 40,000 + 5,000 + 600 &+ \\ 8,000 + 1000 + 120 &+ \\ 1200 + 150 + 18 &= \\ 56,088 \end{align} \]

We follow this procedure with base two as well. Each digit in 00000101 is multiplied with every digit of 00000110. Thus we will have 64 individual digit-digit multiplications, 8 digits multiplied with 8 other digits. Luckily the possible values of each combination in binary is only 4 possible values instead of 100 possible values. For simplicity the example numbers 5 and 6 have been reduced to half-bytes to reduce the number of "times zero" multiplications in the calculation. The below table shows how a computer breaks down the multiplication digit by digit:

\[ \begin{align} 0110(5) &\\ * 0101(6) &=\\ (0 * 1) 0000 * 0001 &+ \\ (0 * 0) 0000 * 0000 &+ \\ (0 * 4) 0000 * 0100 &+ \\ (0 * 0) 0000 * 0000 &+ \\ (2 * 1) 0010 * 0001 &+ \\ (2 * 0) 0010 * 0000 &+ \\ (2 * 4) 0010 * 0100 &+ \\ (2 * 0) 0010 * 0000 &+ \\ (4 * 1) 0100 * 0001 &+ \\ (4 * 0) 0100 * 0000 &+ \\ (4 * 4) 0100 * 0100 &+ \\ (4 * 0) 0100 * 0000 &+ \\ (0 * 1) 0000 * 0001 &+ \\ (0 * 0) 0000 * 0000 &+ \\ (0 * 4) 0000 * 0100 &+ \\ (0 * 0) 0000 * 0000 &= \\ \\ \text{first bit of 5 times all bits of 6: } 0 + 0 + 0 + 0 = 0000 + 0000 + 0000 + 0000 &+ \\ \text{second bit of 5 times all bits of 6: } 2 + 0 + 8 + 0 = 0010 + 0000 + 1000 + 0000 &+ \\ \text{third bit of 5 times all bits of 6: } 4 + 0 + 16 + 0 = 0100 + 0000 + 10000 + 0000 &+ \\ \text{fourth bit of 5 times all bits of 6: } 0 + 0 + 0 + 0 = 0000 + 0000 + 0000 + 0000 &= \\ \\ 16 + 8 + 4 + 2 + \text{lots of zeros} = 30 = 00011110 \end{align} \]

This method is known as a binary "truth table". In base ten, "multiplication tables" are taught to elementary students and are memorized to multiply any digit 0-9 with any other digit. Multiplication tables are equivalent to a base ten "truth-table". In base-2 the truth table is as follows: \(0 * 0 = 0, 0 * 1 = 0, 1 * 0 = 0, 1 * 1 = 10\). The result is shifted left based on the place of each digit. For example: \(0101 * 0001 = 0101 \text{which is equivalent to: } 5 * 1 = 5 \), \(0101 * 0010 = 1010 \text{which is equivalent to: } 5 * 2 = 10 \), \(0101 * 0100 = 00010100 \text{which is equivalent to: } 5 * 4 = 20 \). There's an important pattern to note here. If we multiply a binary number by another binary number where the second binary number only has one "one" digit and all the other digits are zero, we effectively shift the first binary number to the left by that digit's "place" in the second number. This makes logical sense because we are multiplying some number by a power of 2. When we multiply a number by a power of two we call it a logical shift left. The amount of shifting is equal to what power of two we are multiplying by. We did this earlier with binary number 5. When we multiplied binary 5 with binary 1 we multiplied it by 2^0, and we also shifted it zero places to the left. This is because 5 * 1 is equal to 5. When we multiplied binary 5 with binary 2 we multiplied it by 2^1, and we shifted it 1 place to the left to get ten. When we multiplied binary 5 with binary 4 we multiplied it by 2^2 and we shifted it two places to the left to get twenty. This pattern continues, multiplying by 8 is a shift to the left of three places because \(8 = 2^3\). Multiplying by one-half is a shift to the right by one place because \(2^-1 = 1/2\). This works for any power of two. We use a strategy of splitting the second number in the multiplication (the multiplicand) into its powers of two when calculating by hand and by computer. We then simply shift the first multiplicand by these powers of two, and then add the result for each separate multiplication. This means that multiplying a four digit binary number with another four digit binary number results in four "power of two multiplications" we call logical shifts, and then four additions of each result to get the final result. This pattern is called Logical Shifting. We can Logically shift left to multiply by positive powers of two. We can Logically shift right to multiply by negative powers of two. Combining these logical shifts and addition is what implements binary multiplication in hardware.

As an example of multiplication with multiple digits we have: 0101 * 0110 (5 * 6). We break this down into two multiplications, just like we do for base-10, giving us: 0101 * 0110 (5 * 6) = 0101 * 0100 (5 * 4 or 5 * 2^2 or binary 5 shifted left 2 times) + 0101 * 0010 (5 * 2 or 5 * 2^1 or binary 5 shifted left 1 time). The result of the first shift is: 0101 * 0100 = 00010100. The result of the second shift is 0101 * 0010 = 1010. Adding these two sub-results gives us the final result: 00010100 + 1010 = 00011110. Checking this result to make sure it equals base-10 thirty we have: 00011110 = 0 * 1 + 1 * 2 + 1 * 4 + 1 * 8 + 1 * 16 = 2 + 4 + 8 + 16 = 30.

This binary system also has it's own modification to regular addition, subtraction, multiplication, and division, as well as every other mathematical operation. No matter what base is used, all mathematical operators work the same way. This is important in number theory because it allows humans to compute things using only two states instead of needing ten states to represent the standard digits of base 10.

Extension of binary to signed integers

So far binary representation has been expressed as positive whole numbers. These are called the "natural numbers" in mathematics. Negative numbers also exist, so we must represent these negative numbers in binary. Scientists debated on what standard to use to represent negative numbers for a long time but they eventually settled on the Two's complement method.

Two's complement is a standard for representing and creating negative numbers. Two's complement works by using the leftmost bit in a binary number as the sign bit. This means that this left-most bit does not encode value, instead it encodes the sign of the number, whether it is positive or negative. We call binary numbers encoded this way "signed". Binary numbers that don't use a sign bit and use all bits for value are called "unsigned". For example, an 8-bit signed integer represented in binary like "00001000" represents positive 8. If we changed this signed integer to have a "1" bit in the leftmost bit position like so "10001000", this would encode negative 8. This is different from an unsigned integer where "00001000" would be 8 and "10001000" would be 128 + 8 = 132.

Its also possible to convert a positive binary number into a negative binary number with the same absolute value. This is done with signed binary integers in two steps. The first step is to take the binary number we want to invert, for example: 00001000 which equals 8 and flip every bit. This means every bit will swap zero with one or one with zero. In our example of 00001000 inverting this number gives us 11110111. The second step is to add positive 1 or 00000001 to the inverted number 11110111. Adding these two numbers gives us 11111000 which is the binary representation of "-8". We can check this by adding the binary "8" to "-8" and seeing if the result is 0. If we add 00001000 and 11111000 we get 00000000 thus verifying that 11111000 is the correct representation of "-8".

Binary Subtraction

By extending binary to use signed integers we also introduce subtraction. Subtraction in mathematics is really just addition with negative numbers. This means that implementing subtraction is a simple extension of addition and 2's complement inversion. Borrowing from our earlier examples: say we want a computer to do "8 - 5 = 3" in binary. First we recognize that eight minus five "8 - 5" is equivalent to eight plus negative five "8 + (-5)". Then all we need to do is convert positive five into negative five in binary, and then add positive 8 with negative 5 to finish the calculation.

First we take positive five in binary "00000101"
Next we flip all the bits of positive five to start the 2's complement process: "00000101" becomes "11111010"
Next we add one "00000001" to the previous result "11111010" to get negative five "11111011"
Optionally we show that adding negative five and positive five gives zero: "00000101 + 11111011 = 00000000
Next we add postive eight "00001000" and negative five "11111011" to get positive three "00000011"
Thus we have calculated "8 - 5 = 3" as "8 + (-5) = 3" or equivalently: "00001000 + "11111011" = "00000011"

In hardware this operation is implemented in similar to the mathematical procedure described above:

First we store positive five in binary "00000101" as wired inputs of high and low voltage. Typically each bit would be input from the pins of a bus wire which is any conductive material connected to the rest of the circuit. "0" indicates low voltage, "1" indicates high voltage
Next we flip all the bits of positive five. We call this state "1's complement". This is implemented as a simple NOT gate that inverts the voltage for each input. NOT logic gates are contructed using transistors.
Next we add one to the previous result. This is implemented using a regular Adder circuit. As discussed earlier Adders are composed of five pins: input A, input B for the operands, An incoming "carry" bit often labelled Cin, an outgoing "carry" bit often labelled Cout, and the result bit. To "add one" to the "one's complement" of five we set the following: Set the "carry-in Cin" bit of the rightmost bit to "1". set input-A to "11111010". Set input-B to "00000000". We then tell the circuit to add. This is equivalent to adding one to any input. The output of the circuit will be negative five.
Next we store the previous output of negative five somewhere so we can use it as an input into the adder circuit
Next we set the following for the Adder circuit: Input-A is set to (-5). Input-B is set to 8. Carry-in is set to 0. We tell the circuit to add these two numbers and the result will be positive 3.

Binary is why we made Transistors

Historically, computers used a number of different bases for their computations. All modern computers have settled on base-2 as the ideal way to do computation. Base-2 was chosen because of the simplicity of representation and also because its easy to implement binary with electrical circuits. Electrical circuits also make good computers for many other reasons. To see why we will examine what makes a computer "good". In order to make a good computer, you would in theory want a device that has the following properties:

A storage method that allows you to store lots of information in the smallest possible space
A computation method that doesn't require lots of energy
A transmission method thats as fast as possible
A computation method that is safe and reliable, not prone to failure, and as accurate as possible
A storage method that can swiftly switch between states for computations

Many devices have attempted to satisfy these requirements, and the best implementation we currently have is the modern electrical binary computer. Modern computers use electrical circuits because they satisfy these requirements the best compared to any alternatives:

Electrical circuits were miniaturized, which created a storage medium that stored lots of information in very small spaces.
Electrical circuit computers are much more efficient in energy per bit of information compared to earlier mechanical gear computers, where for example, each gear tooth might've been one "bit".
Electricity and light are the fastest things in the universe, so transmission speed is unmatched compared to any other method.
Electrical circuits (when designed correctly) are extremely reliable, predictable (thus accurate with results) and not prone to failure.
Electrical circuits can be designed with switching speeds in the sub-nano second range. This means computations happen extremely fast compared with previous methods.
Electrical circuits can implement binary as "On" and "Off". making these types of computer simpler to implement

We represent bits in circuits using High and Low voltage. First the engineer decides on a voltage range for the circuit (or sub-circuit). High voltage is the top of that range, usually something like 3.3V, 5V, or 12V, depending on the circuit. Low voltage is at the bottom of that range, at 0V. Circuits are designed to store bits as high or low voltages. These voltages are electric charges that are held in transistors. Transistors are a type of circuit device that can rapidly switch between a conductor and an insulator depending on an external input.

In layman's terms a transistor is a glorified lightswitch, but extremely small and fast. The reason Silicon Valley gets it's name is because Silicon is a semi-conductor element. A semi-conductor is defined as a substance which can become a conductor or insulator based on some external factor. As discussed earlier, transistors are circuit devices designed to do exactly this, switch between conducting and insulating. Think about why a lightswitch is useful to us. A lightswitch is only useful if I can control when electricity flows (conducting), and when it doesn't (insulating). Otherwise the lights would always be on, or off, and I wouldn't be able to control them. Similarly, I want to control the bit state held inside a transistor. I can't do that if the transistor is always at High voltage (conducting) or Low voltage (insulating). The transistor is only useful if I can cause it to switch between on and off states.

A Metal copper wire doesn't make a good lightswitch, because electricity always flows.

A rubber wire doesn't make a good lightswitch, because electricity never flows.

A semi-metal silicon transistor makes an excellent lightswitch, because I can control whether it behaves like a conductor or an insulator. This property is why scientists use silicon and other semi-metals for transistors. Modern transistors come in many forms and have a long history of improvement. One of the earliest types were vacuum tube transistors. These were large bulky devices that were prevalent in the 1950s. As transistor design evolved, new types of transistors were created. The most prominent was the Metal Oxide semi-conductor Field Effect Transistor (MOSFET) first demonstrated in 1959. This type of transistor is the type used in all modern computers today. Lets first break down what MOSFET stands for:

Metal represents the Silicon metal used
Oxide represents the oxygen + other molecule (ususally silicon oxide) material used
Semi-conductor refers directly to the purpose of the transistor, being a semi-conductor, by using semi-conducting materials
Field refers to the magnetic field. This magnetic field is created by an external circuit. This is what tells the device to switch
Effect refers to the "Field Effect" which is the effect of the magnetic field on the transistor
Transistor

Putting this all together, a MOSFET is a transistor made of semi-metal and Oxygen molecules that act as a semiconductor, and that semi-conductor is controlled by the field effect of an external magnetic field. This means that MOSFETs change their conducting behavior depending on this external magnetic field. Engineers create circuits that switch transistors using small electrical currents. This works because electrical currents generate magnetic fields. The external current is the mechanism behind the Field Effect of the transistor.

The arrangement of Transistors is what defines Boolean logic gates

The theory behind modern computing was thought up by a genius named George Boole. Boole introduced a new branch of math called Boolean algebra, which is the algebra of logic. Boolean algebra has two states for any logic "question": True or False. These "questions" can be evaluated by themselves, or with other "questions" as part of a complex expression. Ultimately by chaining logic expressions together we can recreate the rest of mathematics and perform calculations. This is useful to humans because we can use two absolute states to do math. As discussed earlier, we store information in binary states. We do this with transistors that store these bits as High and Low voltages. To implement Boolean logic, we first must represent the "True" state as a high voltage, and the "False" state as low voltage. Then we must implement the Boolean operations. The arrangement of transistors is what implements Boolean operations in the physical world. Boolean logic expressions are created by chaining these operators together.

Boolean algebra has the following fundamental operators:

NOT - outputs False when the input is True. It inverts the input
AND - outputs True only if both inputs are True, otherwise it outputs False
OR - outputs True if either input is True, only outputs False if both inputs are False
XOR - outputs True if the inputs are different (either True and False or False and True). If the inputs are the same (True and True or False and False) the output is False
NAND - outputs the inverse of AND. This means it outputs False only if both inputs are True, otherwise it outputs True
NOR - outputs the inverse of OR. This means it outputs False if either input is True. It only outputs True if both inputs are False
XNOR - outputs the inverse of XOR. This means it outputs False if the inputs are different. If the inputs are the same the output is True

These operators are called gates when implemented as a circuit. These gates are wired together to create Boolean logic expressions. An example expression is 3-AND, which is a logic gate that outputs High Voltage only when all three inputs are High Voltage, otherwise the output is Low Voltage. This is equivalent to the Boolean Expression "A & B & C" where "A, B, C" are the inputs and "&" represents the AND boolean operator. The logic result of this expression is True only if all the inputs are True.

The arrangement of Boolean logic gates is what defines a Computer Architecture

Computers are complex arrangements of Logic Gates that allow us to compute regular math very fast. These calculations drive everything in a computer. All computer graphics, all human to computer interactons, the Internet itself, all of it is fundamentally just mathematics. We design a complex circuit to do that math fast and accurately. We use the circuit to store this useful information for later use. We call that circuit a computer.

The structure of a computer starts in the Central Processing Unit or CPU. The CPU contains the physical transistor circuit that does calculations. The only other part of the computer that processes information is the Graphics Processing Unit or GPU. some computers do not have GPUs and only have a CPU. All the other parts of the computer simply exist to support the CPU, store it's result, or power the CPU. The physical CPU design starts with designing the circuitry that allows it to Compute math. This sub-section of the CPU is called the Arithmetic Logic Unit or ALU. The ALU receives inputs from the CPU and does the actual calculations in hardware.

The ALU and how binary operations are done in hardware

We will first focus on the implementation of the ALU. To start we need to implement the four basic mathematical operations on binary inputs in the ALU. We will start with Addition because binary addition is the beginning of binary operations. We need to implement addition first because we use the addition circuit to implement multiplication, subtraction and division.

INSERT HARDWARE DISCUSSION OF FULL ADDER IMPLEMENTATION THEN DO TWO'S COMPLEMENT IMPLEMENTATION WITH NOT GATES THEN DO SUBTRACTION IMPLEMENTATION AS ADDITION CIRCUIT + TWO'S COMPLEMENT CIRCUIT

The first computer architectures contained:

A circuit that computed and stored information in the same place, most often in a physical medium like gears or dials
A power source, often mechanical or human powered

Some examples include the Turing machine, the Antikythera mechanism, and the Charles Babbage Difference Engine. All of these were Analog computers, meaning they did not use binary states of Boolean True or False to do their computations. Analog means that the storage of information was based on a continuous flow of state. This is different from digital, which stores state in two distinct non-continuous states. They were also designed with a specific calculation in mind, and could only run one program. Often they were mechanically driven instead of the modern way of being electrically driven.

The Evolution began when a genius named Vonn Nuemann came along and proposed that the memory storage be made separate from the computation mechanism. The "Von Neumann model" is an upgraded computer architecture that uses a separate memory to store program instructions and data. This was an evolutionary leap because the storage of results and instructions was no longer tied to the processor that computed the results. Before this, computers used physical objects to process and store the results. Using gears as an example, the Charles Babbage difference engine processed the program in the gears of the device. It also held the results of the computation in the gears of the device. This meant that results had to be written on paper separately by an operator. It meant that if you wanted to change the program's state, you'd erase the results and the current processing. If you wanted to compute a different program, you'd have to manually adjust the gears yourself. It is this crucial separation of computation and memory that served as the fundamental turning point towards modern computers. By storing data separately and automatically, the computer could also read data as instructions and set its own program state without human intervention. This most fundamental modern computer architecture contains:

An electrical Central Processing unit (CPU)
A Memory storage medium that used electric signals to read/write
A stored electrical power supply
A series of wires to connect everything

Some examples include Punchcard computer systems.

As discussed earlier, Modern computers chose to use a binary system of electrical circuits including transistors to implement boolean logic. This logic is implemented to create a computation engine which we call the "Central Processing Unit" or "CPU". The CPU contains sub-circuits that do whatever instructions we have designed it to. The CPU contains many parts:

Registers, which are small memory circuits within the CPU itself that the CPU can use to temporarily hold binary instructions or data for immediate processing
A Memory Cache, which in modern computers usually has multiple levels (usually 3). This stores recently used information for faster recall. This is implemented because memory lookups are the slowest operations a CPU performs. Minimizing the need for memory lookup is essential to modern computer design.
Sub-circuits for each part of the instruction set. The addition operation is implemented as a physical sub-circuit that adds two numbers using two input registers as its inputs. The result is transmitted to the memory store subcircuit.
A program counter register which tracks where in memory the CPU is running programs from.
Multiple sub-CPU "cores" that each function as a mini-CPU. Since a single "core" runs only one program at a time, multiple "cores" are used to run multiple programs simultaneously. Even with one core, computers can appear to run multiple programs at a time because the CPU constantly switches tasks imperceptibly fast.
A clock circuit to dictate switching speeds for the CPU. This clock circuit uses electrical properties to switch states extremely fast and at a precise time. This clock circuit dictates how fast the CPU switches it's own state. That means it controls how fast the computer does calculations. If set too fast, the circuit in the CPU will physically break. Setting the clock speed of your CPU higher than the manufacturer setting is known as "Overclocking".

There are lots of types of instructions, and lots of instruction sets. Here is an example of one:

An Add instruction. Takes two binary numbers, adds them, and stores the result
A subtraction instruction. Takes two binary numbers, subtracts them, and stores the result
A multiplication instruction. Takes two binary numbers, multiplies them, and stores the result
A division instruction. Takes two binary numbers, divides them, and stores the result
A Load instruction. Takes information from memory and puts it in the CPU
A Store instruction. Takes information from the CPU and puts it in memory
A Move instruction. Takes information from one part of the CPU to another

A basic CPU consists of five steps to conduct these instructions:

Instruction Fetch: reads the input voltage of the selected parts of memory. this includes CPU registers, Memory cache, or Memory itself if neccesary. the instruction is a sequence of input bit voltages that are designed in the circuit to represent an instruction. These instructions start with an instruction code such as "00110011". This sequence of 8 bits is actually a sequence of High and Low voltage outputs that are fed as inputs to the CPU circuit. The Bus wire connects the circuits together. This sequence could hypothetically represent an "addition" instruction. The rest of the instruction would then follow. Hypothetically this could be two 8 bit numbers like "00000000" (0) and "00000010" (2). another 8 bits "00000100" (register number 4) could be used as a register address to store the result.
Instruction Decode: Puts the input instruction voltage through a decoder circuit that selects which computation instruction to do. We may also need to fetch input signals to complete the operation, if the operation needs something from memory
Instruction Execute: Puts the decoded instruction input through the correct instruction circuit to complete the instruction
Memory store: Wires the output of the memory store sub-circuit to the Memory circuit itself. This stores the bit value of that instruction
Result store: Puts the Output of the CPU instruction into the CPU registers

A basic Memory takes the voltage outputs of the CPU across the bus-wire connection and stores it in a storage medium. This can be done in a variety of different ways. It uses a memory address provided by the CPU's instruction to specify where to store the information. This memory also needs to be readable in some way. Some very old memory mediums are physically observable, like punchcards, dials, or vacuum tubes. Modern memories store information in transistor states, so that the amount of memory can be very large, and so the memory stays as it was, even if the computer is not powered on.

The Computer Architecture defines the Computer

Modern Computers are much improved from punchcard machines. Punchcard machines are based on Assembly Language. Assembly Language is a circuit dependent language that is a direct 1-1 translation from the Computer's instruction set, to the Assembly language instruction set. This is the bridge between software and hardware. But humans were never meant to talk in 0 and 1, so they developed compilers. Compilers are complex programs that turn a human readable programming language into assembly language for a wide variety of different computer types. One of the first examples was the programming language FORTRAN. Using these more readable and standardized programming languages, humans were able to program computers more efficiently because every program could be written in FORTRAN, yet run on machines with different instruction sets. At the same time, a hardware Evolution was underway which would allow computers to be smaller, faster, more efficient, and talkative.

Modern computers usually include:

A Cooling system using fans or water pipes
A set of (uaually multiple) Random Access Memory (RAM) chips that handle ongoing programs and their memory operations.
A bridge between the CPU and Memory (including RAM) that contains a memory cache. This cache holds often used data so that it can be recalled quicker in the CPU circuit.
A dedicated long term Memory storage medium typically using optical disk drives (hard drives) or Solid state drives (using electrical circuits that are state-stable even with no power)
A Universal Extensible Firmware Interface (UEFI) chip, used to boot the machine and load the Operating System. This evolution allows the computer to run multiple programs with the push of one power button. Before this, computers could only run one program. The UEFI contains a simple circuit that receives an "ON" signal from your button press, and loads instructions to boot the operating system. This program usually also does basic checks on the hardware devices in your computer to see if they're working.
A peripheral circuit connection for a computer mouse, usually by USB
A peripheral circuit connection for a TV screen, usually by HDMI
A peripheral circuit connection for a WiFi micro-wave radio antenna (also known as a Network Chip). This allows your computer to send and receive micro-wave radiation signals in specific frequencies. These are usually 2.4 Giga Hertz (2.4 billion cycles per second) or 5 Giga Hertz.
A perihperal circuit connection for an Ethernet cable. This cable sends and receives internet data using copper wires.
A peripheral circuit connection for a sound system, usually by Universal Serial Bus (USB)
A motherboard that connects everything and has additional circuit connections for additional chips

A list of the relevant inventions that go into a modern computer include:

The invention and miniaturization of the transistor, which replaced Vacuum tube transistors
The invention of Light Emitting Diodes (LEDs) which replaced Cathode Ray Tube Television Displays and regular resistor lightbulbs
The invention of Solid State Drives which replaced Optical Disk Hard Drives, which had replaced magnetic tape readers, which had replaced paper punch cards
The invention of lasers. Lasers enabled Fiber optic network communication, Optical Disk reading, computer mouse position reading, and much more!
The invention of micro-waves. Micro-waves aren't just for cooking! Micro-waves are also used at low power to transmit Binary data very fast. This is because micro-waves have a high frequency, which is the number of cycles per minute the micro-wave wave signal can switch between high and low signal. This means that antennae that transmit/receive these frequencies can do so at high speed.
The standardization of USB, HDMI, Ethernet, Wi-Fi which are all standards of binary communication that multiple groups have standardized between lots of different devices. This is what enables many different computers and related devices to talk to each other in a common format.

Computers form Networks

Artificial Intelligence

While AI seems like a complicated topic, understanding the mathematics behind AI reveals it is simpler than some people think. Fundamentally, AI in its modern form is a pattern recognition machine. All AI architectures use probability and statistics to transform digital inputs into usable patterns. From a high-level view there are a couple main AI "types". The main AI types in use today are:

Generative AI: generates new content based on learned patterns from training datasets
Discriminative AI: discriminates, recognizes, and labels patterns in data. This type is used in image classifiers, computer vision, and data labelling tasks.
Reinforcement learning AI: develops patterns of action to accomplish tasks based on programmed "rewards". This AI is similar to the video-game idea of Aritificial intelligence, it takes an action within the scope of a game and measures output and reward based on programming. This allows it to play games or do tasks that require strategic thinking.
Self directed learning AI: develops patterns without programmer input by relying on accomplishing a simple task. The "simple task" is easy to verify, but it causes the AI to learn complex patterns by using the simple task as a reward signal. For example, the natural structure of the english language means that certain words are correlated. Certain letters and arrangements of words correlate with grammatical structure. An AI can learn these statistical relationships based on a "fill in the blank word" task. This bypasses the need for a human programmer to program every rule of grammar into the model. Instead the model analyses enough "good text" to imitate good grammar by imitating the style of good text because it gets good at predicting the missing word. It does not have an absolute programmed state that informs the grammar of it's output. This is precisely why AI is "gas-light-able" because if the model is trained on bad text it will associate the wrong "fill in the blank" pattern with good output and serve it to you. Many AIs use this method to appear to learn the rules of languages when in reality they are simply following the statistical pattern of the training dataset and the task.
Unsupervised learning: develops a pattern without any human intervention OR reward signal. Instead the AI analyses fundamental statistical relationships in the data, like mean, variance, and value clustering to observe and report details on that data like how many variables the data likely has. In a graph of a statistical relationship between data, it is possible to quantify "roughly how many variables are needed to represent this data". The model may also be asked to represent the data with less variables. This informs the model to optimize the data.

AI of all types are made from a similar process. In the following example assume we are training a Convolutional Neural Network (CNN) to take photos of cars and classify them:

Gathering the car photo data for training. This will include photos of things that aren't a car so we can train the model on each of our classifications and "other" data.
Normalizing the car photos. This means resizing the images to a standard, ensuring they're encoded the same, and validating the content of them to make sure they don't include erroneous training data such as a picture of an animated cartoon car.
Feed normalized input into the first layer of the CNN.
Convolute small 3x3 filter matrices with the input. These matrices initally contain random value elements. These elements are adjusted statsitically into the filters used after backpropagation.
Use an activation function to capture the filter response. Typically a "higher than threshold -> pass response" type filter is used. This generates the feature map image result for each filter matrix. Its very similar to a greyscale ish image. each pixel of the feature map (for example the first 224x224 pixels) is one "Neuron" that can either be activated or not activated. If the neuron is not activated then the pixel is set to 0 and is black. If the neuron is activated then the pixel is set to a shade of white that correlates with how activated it is (brighter = more active). Sometimes the activation value has to be normalized to a 0-255 brightness value common for regular images. This is because the activation value itself may have a widely different range.
OPTONAL: Pool or downsample by taking the activation function of parts of the whole input rather than the whole input. In practice this means a 224x224 image is downscaled to 112x112 as a 2x2 feature generating matrix slides over the big image and averages 4 pixel values into one pixel value. This is sometimes done to "capture" bigger features while improving efficiency
Batch Normalize the layer output similar to normalizing the car photos. In practice this means setting the mean to 0 and the variance AKA standard deviation to 1. The model makes both of these normalization factors parameters so in practice they aren't always zero and one instead they're learnable parameters. The mean adjustment is done mathematically through a shift (if mean was 67 then the shift would be -67). The variance adjustment is done mathematically through a scale factor (if variance was 2.6 then scale factor would be 1/2.6).
Repeat steps 4 through 7 for as many layers as needed. Layers progressively use double the number of filter matrices to capture bigger features. Powers of two are used primarily due to hardware efficiency and memory alignment. When this is done on say a 224x224 pixel image as a typical example, 64 is the depth selected for the first layer. The filter matrices themselves are typically held at a constant size such as 3x3. These matrices slide across the image (with padding on the edges) and they convolute with it. This generates 64 separate feature maps that represent the basic patterns in the image like horizontal and vertical lines. The next layer applies 128 matrix filters to 64 feature maps. But it doesn't store all 128x64 combinations as a separate feature map. Instead it sums all 128 combinations on one feature map much like a summation of sines and cosines represents a superposition. This is also quite similar to summating a series of normal maps and color maps to generate an overall material in Blender/Unity. The output then is a weighted sum of all 128 would-be feature maps condensed into one sum that serves as the result feature map (10% of feature map 1/64, 23% of feature map 2/64 etc.). Thus the number of result feature maps is equal to the number of filter matrices used (the depth) however the number of total data represented quickly spirals exponentially (64x128x256x512) etc. Even the number of actual "work" performed still doubles as we double the number of filter matrices from 64,128,256,512,1024 etc. Thus to combat this we pool or downscale our image in later layers so that later layers apply their more complex filters on feature maps that are averaged. For example you might take the feature map output from the second layer and downscale it from 224x224 to 112x112 such that the next layer does 256 feature map convolution calculations on an image that is 1/4 the original size.
Output the inference guess in the final layer. The final output (for example using 512 as the final layer's feature map count) the program will average all pixels for every 512 "features" to create a final "feature vector" that represents the "strength" of that feature. It is this final list that starts categorizing images. Usually by this layer, the original image has been pooled to a tiny size from 224x224 to even 7x7 so the averaging for each of the final "strength vectors" occurs over 7x7 = 49 pixels. This "strength" is then labelled for what "feature" it represents. The inference is generated by analyzing the weights of the final feature vectors. The example given for cars might include a feature labelled "has wheels" and "is metal". If both these features rank high in the final feature vectors the inference is more likely to become "car". This final inference result is what is passed on to the user. These final "logits" are not selected by humans but are learned through the model so we won't typically know what they are unless we do intensive back-calculation. This is why people say AI is a "black box" because it makes inferences and selects logits without us explicitly knowing what those logits are. We can (through intesive math) figure out WHY the AI chose those specific logits, but doing so is computationally expensive. In order to handle say 100 different car classifications an AI will need more than 100 "feature vectors" because otherwise it would try to fit any image of say a tree into 1/100 possible "cars". Since that isn't useful, AI designers use multiple tricks to stop the image classifier from classifying anything with a car. First they have at least one extra classification bucket for "other" data during training. Because we couldn't possibly show a model every possible object that isn't a car we also have to use other strategies. These other strategies include training an AI on "is this even a car or not" before we pass on the result to "what kind of car is this?". We also have confidence scoring, which is a score of how well an image of a tree would even match one of the 100 cars we want to classify. Since a tree image would be very different, the score would be low, and the AI would generate a low confidence score for its inference of "this tree image = that kind of car". We tell the program to instead discard any inferences with low confidence scores as "other" so if the AI is only 20% confident the source image contains a lamborghini it classifies it as "not a car".
Calculate the loss using a loss function to assess inference goodness. Many different loss functions exist. They vary ususally based on the type of model. For classification models Cross-Entropy loss is used. The formula for Cross entropy loss is: \( H(y,p) = -\sum_{}y*log(p) \) where \( H(y,p) )\ is the cross-entropy \( H \). The true label is \( y )\ which is our "corrective labelled image. The prediction our model has is \( p \) with the value of that being the probability or confidence of the prediction. The Summation \( \sum \) in the equation represents the sum over all the different "feature vector" categories meaning \( y*log(p) \) is calculated per feature vector and then we add them all together. The feature vector categories are a list of probabilities for each "class". For an image classifier of 100 cars with 1 bucket for "other" there would be a list of 101 probabilities (0.0 to 1.0). The "true" label is created by a human. For example, in our car clasifier, the "1.0" probability would appear exactly once in the list. If the image was a Lamborghini then the image would be labelled with this list and the single class that corresponded to "this is a Lamborghini" would have probability 1.0. All other probabilties would be 0. The models prediction is the same list but would contain probabilties based on the calculated weights. Probabilities would be float values ranging between 0 and 1. Assume the model predicted 0.2 for the Lamborghini class, then this means \( y = 1.0 \) and \( p = 0.2 \). For other classes \( y = 0.0 \) and \( p \) would equal whatever 0-1 probability the model computed. Adding these all together creates the loss.
Backpropagate by adjusting all parameters using a gradient descent function and the calulated loss from step 10. First we calculate the gradient of each weight. We consider each element in each filter matrix a "weight" for that layer. First we calculate the gradient for that layer. To do this the computer must first have recorded every operation for every weight in the form of a Directed Acyclic Graph. Directed meaning the data flows one way (forward propagation). Acyclic meaning the graph doesn't contain cycles (a graph of a neural network doesn't contain loops between layers). The graph keeps a "tape" of every operation Using the chain of operations the computer calculates the Calculus chain rule to backpropagate the weights from the final layer to the first layer. TODO not fully understanding how exactly the backpropagation happens. Next we compute the following: \( Wn = Wo - (n x G) \) where Wn is the new weight we want, Wo is the old weight, n is a custom learning curve we choose, and G is the calculated gradient. We subtract the gradient (times our learning rate factor) so that we go in the direction opposite the gradient (descend), hence the name gradient descent.