Tuesday, May 28, 2013

Computer Organization and Architecture

Computer Organization
The question- ‘How does a computer work?’ is concerned with computer organization. Computer organization encompasses all physical aspects of computer systems. The way in which various circuits and structural components come together to make up fully functional computer systems is the way the system is organized.

                                                                       
                                                                                                                                 fig-1.1


Computer Architecture
Computer architecture is concerned with the structure and behavior of the computer system and refers to all the logical aspects of system implementation as seen through the eyes of a programmer. For example, in multithreaded programming, the implementation of mutexes and semaphores to restrict access to critical regions and to prevent race conditions among processes is an architectural concern.

                                                                                                                                                        fig-1.2

Therefore,
What does a computer do? = Architectural concern.

How does a computer work? = Organizational concern.

Taking automobiles as an example, making a car is a two-step process on a general level- 1) Making a logical design of the car, 2) Making a physical implementation of that design. The designer has to decide on everything starting from how to design the car to achieve maximum environmental friendliness and what materials to use in order to make the car cost effective and at the same time make it look good. All this falls under the category of architecture. When this design is actually implemented in a car manufacturing plant and a real car is built, we say that organization has taken place.


What are the benefits of studying computer architecture and organization?
Before delving into the technicalities, commonsense tells us that as a user it is almost perfectly all right to be operating a computer without understanding what the computer does internally to make ‘magical’ things happen on the screen, or how it does it.

if a computer science student (with a lack of knowledge on computer architecture and organization) were to write a code that does not comply with the internal architecture and organization of his/her computer then the computer would behave strangely and in the end the student would have to take it to some service center and blindly rely on those guys to fix the problems. If this is the case, then the only difference between a user and a computer science student is that the latter knows how to display the words- ‘Hello World’ onto the computer screen with a couple of different programming languages.

Say, for example, if a game developer were to design a game with the ‘frames per second’ property set to somewhere above 30 fps, his game-play would become much faster but his CPU usage would rise dramatically, making it very difficult for the CPU to do much else while the game is running and slowing down the computer noticeably. This is because the ‘frames per second’ value specifies how many times the screen gets updated every second. If the value is too large that would mean too many updates per second and therefore more and more of the power and attention of the CPU would have to be focused on running the game. If the value is too low, then CPU usage would fall remarkably, but the game would become much slower. Therefore, a value of about 30 would make the game considerably fast without using up too much of the CPU. Without this knowledge the game developer would ignorantly be making inefficient games that would struggle to make any commercial impacts.


What are the factors that prevent us from speeding up?

1) The fetch-execute cycle: The CPU generally fetches an instruction from memory and then executes it systematically. This whole process can be cumbersome if the computer does not implement a system such as ‘pipelining’. This is because in general the fetch-execute cycle consists of three stages:

a)      Fetch
b)      Decode
c)      Execute

Without a pipelining system, if the CPU is already dealing with a particular instruction then it will only consider another instruction once it is completely done executing the current one; there is no chance of concurrency. Therefore that slows down the computer considerably as each instruction has to wait for the other to finish first.
If, on the other hand, a pipelining system is implemented then the CPU could begin to fetch a new instruction while already decoding another. By the time it finishes executing the initial instruction, the second instruction will be ready to be immediately executed. In that way the CPU would be able to fetch, decode and execute more than one instruction per clock cycle.

  2) Hardware limitations: Each simple CPU traditionally consists of one ALU (Arithmetic Logic Unit), for example, and this restricts it to only being able to handle one instruction per clock cycle. With the rapid advancement in technology and the resultant rise in computing demands this is clearly not fast enough. Therefore a possible solution is to take the architecture to the superscalar level.

A superscalar architecture is that which consists of two or more functional units that can carry out two or more instructions per clock cycle. A CPU could be made with two ALU’s, for example.

                                                                                                                                                        fig-1.3

The figure above shows the micro-architecture of a processor. It consists of two ALU’s, an on-board FPU (Floating Point Unit) with its own set of floating point registers, and a BPU (Branch Prediction Unit) for handling branch instructions. The cache memory on top allows the processor to fetch instructions much faster .Clearly such a processor is built for speed. With its four execution units it could execute four instructions all at once.

  3) Parallelism: A typical computer with a single CPU can be tweaked to perform faster, but that performance will always be limited, because after all, how much can you really get out of a single processor? But imagine having more than one processor in a particular computer system.

Such systems do exist in the form of multiprocessors and parallel computers. They usually have between two and a few thousands of interconnected processors- each having its own private memory or having to share a common memory. The obvious advantage with such a system is that any task it undertakes, it can be shared equally amongst all the processors by allocating a certain part of the task to each respective processor.

For example, if a calculation were to take ten hours to complete on a conventional computer with one CPU, it would take far less time in a multiprocessor or parallel computer; i.e. - if the calculation could be split up into ten chunks, each requiring one hour to complete, then on a multiprocessor system with ten processors it would take only one hour for the entire calculation to complete because the different chunks would be run in parallel.
  
  4) Clock Speed and The Von Neumann Bottleneck: The clock speed of a computer tells us the rate at which the CPU operates. Some of the first microcomputers had clock speeds in the range of MHz (Megahertz). Nowadays we have speeds in the range of GHz (Gigahertz).

One possible solution to achieving greater speed would be to increase the clock speed. But however, it turns out that increasing the clock speed alone does not guarantee noticeable gains in speed and performance. This is because the speed of the processor is determined by the rate at which it can retrieve data and instructions from memory.

Suppose a particular task takes ten units of time to complete- eight units of time are spent waiting on memory and the remaining two are spent on processing. By doubling the clock speed without improving the memory access time, the processing time would be reduced from two to one unit of time (without any change in the memory access time). Therefore the overall gain in performance will be only ten percent because the overall time is now reduced from ten to nine units of time.

This limitation, caused by a mismatch in speed between the CPU and memory, is known as the Von Neumann Bottleneck. One simple solution to this problem is to install a cache memory between the CPU and main memory, which effectively increases memory access time. Cache memory has faster access time but comes with lower storage capacity as compared to main memory and is more expensive.


  5) Branch prediction: The processor acts like a psychic and predicts which branches or groups of instructions it has to deal with next by looking at the instruction code fetched from memory. In the best case scenario if the processor guesses correctly most of the time it can fetch the correct instructions beforehand and buffer them- this will allow the processor to remain busy most of the time. There are various algorithms for implementing branch prediction, some very complex, often predicting multiple branches beforehand. All of this is aimed at giving the CPU more work to do and therefore optimizing speed and performance.