Sunday, April 17, 2016

What happens when you power on a computer?

I am developing two operating systems in parallel - Test2OS and PLPOS. Both are toy OS. Test2OS is targeted for x86 architecture where as PLPOS is targeted for PLP architecture, a MIPS based architecture. The main purpose of Test2OS is just to learn operating system. I would like to implement all the basic concepts of Operating System in Test2OS. So based on that knowledge, PLPOS will be developed. Test2OS is already started and its code will be mostly from different kernel development tutorials website and PLPOS development will start soon. So as I learn and develop, I would like to present these concepts in different blog posts. Okay lets get back to this one.

Its a simple question "What really happens moment you press the power button in a computer?". Well you may say our Operating System (Windows/Ubuntu like) will start. Yes true but how does that start? How does your processor, say Intel i7 knows that Windows operating system is installed in your hard disk and it has to start running it?

Before our operating system starts running, there is 2 crucial steps - BIOS and Bootloader. After these two steps comes our operating system. BIOS will load the Bootloader and Bootloader intern loads the kernel or operating system.

This post is all about BIOS - Basic Input Output System. BIOS gives the first set of instructions to be executed by the CPU. As this is the first thing to be run, it cannot reside on a volatile memory like RAM. Earlier systems used hold BIOS in ROM nowadays it is in Flash memory. All the setting related to BIOS are stored in a CMOS chip. Also all the address referred by BIOS is predefined. So when the power on button of the system is pressed, a signal will be sent to mother board. If computer is connected to an active power supply, mother board will send that signal to Power Supply Unit. Power Supply Unit will start providing power supply to various hardware. Once PSU is satisfied that sufficient power is available to each of the hardwares/devices, it will send a Power_Good signal to the BIOS. Now BIOS will start its work.

BIOS main work is to find the bootable device. Apart from this main task, it has certain subtasks
  1. Do a hardware test - Check if all the hardware or devices are working fine.
  2. Initializing different registers
  3. Activate BIOS of other devices like graphical processor
  4. Create interrupt vector table
  5. Manage some settings related to Hard disk, clocks, etc.
So BIOS has received Power_Good. Now first thing it will do is hardware test. Power on Self Test (POST). Some of the tests done are RAM read and write, keyboard test, etc. POST will be skipped if system is doing a warm boot. If 0x123h value is present at address 0000:04772 then it has to do warm boot, else it has to do cold boot. Usually when you are powering on after shutdown that will result in cold boot. If you are just restarting the machine then it will result in warm boot. Once this is done and verified all are working fine, it will create the interrupt vector table so that we can talk to different hardware devices using the interrupts mentioned in this table. In the earlier Operating Systems,  OS will use the interrupts given by BIOS and BIOS will do the talking with hardware but nowadays it is done directly by the OS. Next work of BIOS is to initialize all the registers like AX, BX, DS, CS, etc. Usually everything will be set to 0. BIOS program is always located in a reserved memory area - upper 64K of the first megabyte of system memory. BIOS ROM address will be 0xFFFF0h. So the code segment register is initialized with selector 0xF000h, base register with value 0xFFFF0000h and limit register as 0xFFFFh. Effectively this address is 4GB - 16 bytes = 0xFFFFFFF0h. This is mapped to System ROM mirroring address 0x000FFFF0h. Okay this is very confusing and I know it is and this is best I could come up with to explain. I guess I have quoted directly from one of the reference. 

Next it will check its CMOS chip to see if there are any other BIOS in the system like BIOS of graphical processor unit. It will run those BIOS one by one. Once all these things are done, it will raise the int 0x19h  interrupt. The BIOS would have created interrupt vector table in RAM from 0 to 0x400h address. The address 0x401 to 0x500 is reserved for the BIOS itself. The work of interrupt 0x19 is to locate the bootable device. Again in CMOS setting, we would have set a order for checking the bootable device  like first hard disk, then cdrom and finally floppy disk. As soon as it finds a bootable device it will stop the search. Thats why order is important. Suppose you have two bootable device Hard disk and CD ROM, boot order is CD-ROM and Hard disk, then it will always run from CD-ROM and never reach Hard disk. 

BIOS does not care for the content of the bootable device. It will first check whether it is able to read the first chunk (512 byte - 1st sector) from the bootable device. If so then it will copy that chunk to RAM at address 0x0000:0x7C00.  Sometimes it will check for boot signature that is if 511th and 512th byte values are 0x55 and 0xAA. Once copied and verified boot signature, BIOS will jump to the address 0x000:0x7C00 to start the booting of the Operating System.

When you power on the system, you wont realize that BIOS is doing this much work in the background. Everything happens so fast.  Now that we have identified our bootable device and loaded bootloader code to RAMSo next post will be related to booting of Operating System till then read the references of this post. 

Let me again put it. I am still learning and I may do lot of mistakes in these post. If mistakes are there then point it out so that I can learn and correct it. If its good then comment, +1 it or share it.

References:
Many of the content I read it from the following links. Some of them what I have written might have come directly from these articles.




Wednesday, March 9, 2016

Behind Function call - Dissassembly of the function call in C Microsoft Visual Studio

Just for a curiosity sake, I wanted to know how the function call is actually implemented. I have worked in PLP , where we write assembly programs to a bare hardware. In that we try to simulate the function call by using special instructions like jal/jr. Also in MASM, when we are calling a function with parameters we try to push the parameters to the stack in an understood order and then call the function. Inside the function, we try to access those parameters from stack. 

Now C is a high level programming language and all this stack management is done by the compiler when it is generating object code. I wanted to see how it is done. So in this post I will explain what I learnt in it.

For a function call, there are 3 parts. 
  1. Before function call
  2. During function call
  3. After function return


Lets go one by one. My main function for this experiment is as given below.

int main(int argc, char **argv)
{
int nReturnValue = 0;

int x = 10;
int y = 10;

int value = Add(x, y);

x = 20;
y = 30;

value = AddEmpty();

x = 40;
y = 50;

AddEmptyVoid();

return nReturnValue;
}


I will be explaining 3 function calls - Add [has return value, has parameters], AddEmpty [has return value, but no parameters], AddEmptyVoid [No return value, No parameters].

a. Add Function [int Add(int x, int y)]

int Add(int x, int y)
{
int value = x + y;

return value;
}



So when I call a function in C, internally compiler will translate into the code as shown in the above image. So all the instruction before the Call Add (012A1000h) falls under  before function call, Call instruction and some others (which I will show afterwards) comes under during function call. Finally all the instructions after call comes under after function call category.

Before Function Call: In this function I have 2 parameters - x and y. Here I am pushing those values into stack before going to the function, so that function can use those variables from the stack. As you can see, it happens from right to left that is first y and x. Suppose I have 3 parameters like SomeFunc(x, y, z) then it would be push z, push y, push x. 

During Function Call: Call will store the return address into the stack. That is what ever value present in the EIP will be stored/pushed into stack before going to function. Here EIP will be pointing to add esp,8 instruction as shown in the above image. Before starting actual computation of function, it has to some extra work which is shown in the below image.

That's a lot of work. Lets go one by one. Ebp is base pointer. In every function, this will be holding your frame pointer. Frame pointer is nothing but your starting point of the function in the stack. Stack pointer will keep on varying as function proceeds but frame pointer is constant for that function. When I came inside Add function, Ebp is still pointing to old functions frame pointer, in this case it is main functions frame pointer. First work is backup old functions frame pointer(push ebp) and change eBp to point to this functions frame pointer. Esp is pointing to the top of the stack which is beginning of our function, so copy it to ebp (mov ebp, esp) which forms our functions frame pointer. Once functions frame pointer is established, now allocation of local variables. Allocate space in stack for the local variables. As stack grows in reverse direction, to allocate space in stack we are subtracting the stack pointer(sub esp, 0xCCh). If I have 0 local variables, by default it will subtract 0xC0h that is 190 bytes. Why 190 bytes I am not sure. In this function, I have one local variable value which is of type int. Here again, it is reserving 12 bytes for 1 integer. Once this is done, now backup the old functions registers relating to string or array operation registers that is ebx, esi and edi. This is done by pushing those registers to stack. Next 4 instructions help to initialize the local variables with some value. In this case 0xCCCCCCCCh is stored for each local int variable. For local variables, we have allocated 0xCCh bytes. Divide that number by 4, as we are considering dwords which are 4 bytes. This gives us 0x33h value. rep stos instruction uses registers ecx, eax and edi for its working. Basically it will loop ecx times, during each loop initializes the address as pointed by edi with value mentioned in eax register. Once initialized, it will increment edi by 4. So lea edi, [ebp-0CCh] is initializing edi to the beginning of the local variable section of stack. Then mov ecx, 33h instruction stores  ecx with the number of dwords so that those many times it can be looped. mov eax, 0CCCCCCCCh instruction loads the value used for initializing the local variables. Last instruction rep stos dword ptr es:[edi] will loop ecx times and initializes each dwords pointed by edi to value stored in eax. 
Finally all the necessary work is done, now it will start the function computation.

After Function call: This comes after return statement of the function. Similar to during function call, some work is done by the called function (add) and some done by callee function(main). Basically we need to store the return value if any and then do the cleanup of the stack. Whole idea is before returning to callee return the stack to same state how it was before function call. 

The return value is passed by using eax register. So value variable's value is copied to eax register (mov eax, dword ptr [value]). Next is restore the registers which were pushed at the beginning of the function call. Remember pop has to happen in reverse order that is those registers which were pushed last has to be popped first. So first edi, next is esi and finally ebx. Now go back to the beginning of the function call. Frame pointer will be  pointing to the beginning of the function call. Ebp holds the function pointer. So copy your frame pointer to stack pointer (mov esp, ebp). Remember just before creating the frame pointer of this function, we had pushed the old functions frame pointer to the stack. As we are going back to the old function, restore the frame pointer of the old function (push ebp).  Finally call the ret. This will initialize the EIP to the return address. Remember with Call return address was pushed to stack. ret will pop that return address and store it to EIP. Now control comes back to main function and it would point to add esp, 8 instruction. Remember, for calling function add we had pushed 2 integers to stack, so we need to pop those values from the stack. But we don't use those values, so we just reduce the top of the stack (add esp, 8). Return value of the function is present in eax register. That value will be copied to the local variable (mov dword ptr [value], eax). 

b. AddEmpty Function [int AddEmpty()]

int AddEmpty()
{
int x = 10;
int y = 20;

return x + y;
}

Now this is similar to previous Add function call, but with small changes. This function is not having any parameters. So before Call instruction, we need not have to push any values to the stack. Also after the Call we need not have to correct the stack top as we have not pushed any parameters to stack.

In AddEmpty function, we have 2 local variables. Hence local variable storage is 0xC0h + 0x18h = 0xD8h. So while initializing the local variables, we need to loop 0xD8h/4 = 0x36h times. Rest all are same as previous function call.


Return of the function is same as previous one.


c. AddEmptyVoid [void AddEmptyVoid()]

void AddEmptyVoid()
{

}

As you can see, this function neither has parameters not does it return any value. It will directly call the function using Call instruction. Also after the return of the function it is not copying the value stored in eax register as function is not returning anything.


As it has no local variables, stack allocation for local variables is 0xC0h. For local variable initialization we have to loop 0xC0h/4 = 0x30h times. Rest all are same as previous function calls. 



This explains only for x86. There is a difference when I am using x64. Currently I am running late :-p So will update the post some later day.

Please let me know if any errors present in the explanation or if i have done any wrong statements.


Following are the references which helped me understand the concepts.





Wednesday, March 2, 2016

Hadoop - Architecture

I started learning about distributed file system. It is based on Google File Systems. It is targeted for data intensive applications. It is scalable and fault tolerant. It is targeted to run on cheap commodity computers which are bound to fail either the whole machine crashes or its disk experience intermittent failures. Also when we are connecting large set of computers, some computers may gets disconnected due to network failures. So in normal file systems failure is just exception cases, but in Google file system and Hadoop it is a norm. That is these file systems are designed with failure as certainties.  

Obviously your question would be what is the usual number of computers present within this distributed environment? Hmmm I guess numbers in 1000s. These can be present in same data centers or  it might be present in different data centers all together. 

Next question would be what are we going to process in these system of computers? If we are having this much computers should be super complex computations like rocket science, weather forecast, etc. Well you can use for that as well but even simple problems need big set of machines. Let me give you an example. You have a simple text file containing some text, let us assume, its size is some 16kb. Now you write a C program to count the words occurrence in that file. How much time it takes to execute it? Well hardly seconds. Now let me give you a file with size 16 MB, still it will complete in seconds. Now if I increase it to 16 GB file, well it might drag to minutes. What if it is 16 TB? now it might run for hours before completing. How nice it would be if I can some how use parallelism in counting the words. With our normal C program also you can do that, but overhead of synchronization will lot of time to program and also with synchronization comes the deadlocks and other monsters. And what kind of synchronization are you really looking here - thread or process. Both within same machine. 

Now enters Hadoop. Hadoop provides this distributed environment of handling huge files which helps you to concentrate on the logic of the program rather than on worrying about how to make the program work. Hadoop combined with Map Reduce will help to write a simple program[ or programs] to do the word count on files which are of really huge files. Let me remind you, word counting is a simple program, suppose you want to implement a web crawler or natural language processing algorithm or some complex image processing algorithms. That time you will really utilize Hadoops distributed environment. Best thing is they are saying take a bunch of normal linux cheap machines combine them and you get a beast out of it.

Hadoop is a client server model. There is a Master [namenode], a set of sub masters [datanode] and then there are clients. There is a lot of theory why we need to go to this architecture and what benefits we are gaining by using this architecture. Also why each component has to handle responsibilities in a particular way? I would recommend to read Google File System paper to understand it. In a future post, I will go through GFS. 

In Hadoop, given a file, that file is divided into fixed sized blocks. Usually it will be 64 MB blocks. So client will ask the namenode to add the file to HDFS(hadoop file system). Namenode will divide the file into many blocks and ask the datanodes to store those blocks in them. Now each block will be copied/replicated to 3 different datanodes. This is one of the way of making it more reliable. So a single file might be spread across multiple datanodes. The number of replication can be configured.

So namenode will track the files namespace and also file mapping that is File to blocks mapping. Then where exactly those blocks exists?  that is in which data nodes does the block exists. All this information is kept in memory. Now obviously if namenode fails everything is lost right? And what was the need to keep in memory? Well it is fast. To avoid loosing the data there is a method. There is a base image of hadoop - Fsimage and then there is log file EditLog. Each time a modification happens to the file system like addition of new file, deletion of files, changes to files an entry will be made in the EditLog. So incase of namenode failures, when namenode is restarting, it will use the FsImage as base and start the file system, then it will read the EditLog and starts making changes to that filesystem. Once everything is updated, it will write those changes to FsImage and truncates the EditLog so that fresh changes can be written to the EditLog. 

So the datanodes actually has data. Each of these blocks are stored in the form of local files. So to do any read or write to the files blocks client will ask the namenode to whom should it contact. Namenode will give the list of datanodes. Client will then contact the datanodes to do actual data transfer. By doing this we are reducing the network traffic to namenode. Namenode keeps track of datanodes by means of heartbeats. That is every now and then, datanodes will send a message to namenode giving the report like how many blocks it has, if there is any disk failures, etc. Suppose namenode did not receive a message from datanode for a while, namenode assumes that datanode has crashed and starts replicating the data present in that datanode to another datanode. 


I would recommend to go through this link http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Introduction to know more about the architecture of hadoop. 

Now you might think I need atleast 3 machines - a client machine, a namenode macine and a datanode machine. But we can configure a single machine which has both datanode and namenode. And we can run our client applications run on the same machine. 

Well trying to install hadoop in a single machine. Will update the screenshot once done. Currently facing some problems.

In future, will go through Map-Reduce, Google File System, Project Voldemort and Amazons Dynamo.

Sunday, February 7, 2016

Heap Sort

This sorting algorithm has 3 main parts.

  1. Build Heap
  2. Swap 1st element and last element, Decrease Size by 1
  3. Heapify
you might be wondering now what the hell is heap?!!! Heap is a binary tree representation in which parent element will always have either highest or lowest element among its children - left child and right child. This binary tree will be represented in the array. If parent is holding highest element it is Max heap, else if parent is holding smallest element then it is Min heap. 

Below is the image of the Heap representation in tree form and also in array. 



As you can see in the tree, its a Max heap that is all the parents are having highest value when compared to (root, Left and Right). So to convert it into array - First element will be the root, followed by its immediate left child and then its immediate right child, followed by left child's children and then right child's children.
So if you have a parent index, say i, then its left child is at 2*i + 1, similarly its right child is present at 2*i+2. Similarly given an index j, the parent is present at location i/2.

First step, given an array we need to transform it into heap structure. Below is the code for doing it.

int BuildMaxHeap(int* pIntArray, int nSize)
{
int nReturnValue = 0;

if (pIntArray == 0)
{
return ERR_NULLINPUTARRAY;
}

if (nSize <= 0)
{
return ERR_INVALIDARRAYSIZE;
}

for (int i = nSize / 2; i >= 0; i--)
{
MaxHeapify(pIntArray, nSize, i);
}


return nReturnValue;
}

We start from the last parents that is parents just before leaves and check if it is satisfying Max Heap property. If not then we max heapify it. So once it is heapified we will move to next parent and repeat the procedure until root. 

int MaxHeapify(int* pIntArray, int nSize, int nIndex)
{
int nReturnValue = 0;
int nLeftIndex = LeftChildHeap(nIndex);
int nRightIndex = RightChildHeap(nIndex);
int nLargestIndex = nIndex;
int nTemp = 0;

if (nLeftIndex < nSize && pIntArray[nLeftIndex] > pIntArray[nLargestIndex])
{
nLargestIndex = nLeftIndex;
}

if (nRightIndex < nSize && pIntArray[nRightIndex] > pIntArray[nLargestIndex])
{
nLargestIndex = nRightIndex;
}

if (nLargestIndex != nIndex)
{
nTemp = pIntArray[nLargestIndex];
pIntArray[nLargestIndex] = pIntArray[nIndex];
pIntArray[nIndex] = nTemp;

MaxHeapify(pIntArray, nSize, nLargestIndex);
}

return nReturnValue;
}

So now what is this max heapify? - This will check among Parent, Left Child and Right Child which is largest element. Largest element will be swapped with Parent. Suppose left child had largest, then Parent and left elements are swapped. Once swapped we do max heapify to the exchanged element. That is, in our case if left element was exchanged with Parent, then now max heapify the new left childs children by treating it as root.

So once you are done with this you are now having the heapified array. So by max heapify theory root (0th element in array) should have the highest element among all the elements in the array/heap. To get the sorted array, Take the 0th element and swap with last element of the array. Then decrease the size of the array by one. Now heapify the array from the root. When you do this next highest element will come to the root of the heap. Repeat the process until size becomes 1. 

int HeapSortIntegers(int* pIntArray, int nSize)
{
int nReturnValue = 0;
int nTemp = 0;

if (pIntArray == 0)
{
return ERR_NULLINPUTARRAY;
}

if (nSize <= 0)
{
return ERR_INVALIDARRAYSIZE;
}

BuildMaxHeap(pIntArray, nSize);

for (int i = nSize - 1, j = 1; i >= 1; i++, j++)
{
nTemp = pIntArray[0];
pIntArray[0] = pIntArray[i];
pIntArray[i] = nTemp;

MaxHeapify(pIntArray, nSize - j, 0);

}

return nReturnValue;
}

This looks so completed right? Does it have any benefit? Well, yes its runtime is O(nlogn). Its average case = worst case runtime and it sorts inplace. 

Wow then this should be better than quicksort as quicksort takes O(n^2) in its worst case scenario? 
Well yes and no. Yes because heapsort is always O(nlogn). No because given an array, for which quick sort runs in O(nlogn), if we execute for heap as well as quick sorts, then quick will be running better as heap sort has higher constants in its runtime 

Implementation is based on Chapter 6 - HeapSort of Introduction to Algorithms 3rd Edition [Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, Clifford Stein]

Saturday, February 6, 2016

Quick Sort

This algorithm is quite tricky in terms of flow or concept. Again this algorithm is based on divide and conquer technique. In case of merge sort, importance is given to combining the divided list of elements where as dividing was very simple. But in case of quick sort, its opposite. Here importance is in dividing and there is no work for combining elements. In every iteration we try to find an element and its correct position such that all the elements are present before that element are less than that element and all the elements which are present after that element are having value greater than element. Well you may say thats the concept of ascending order, yeah true but in the first iteration only this element is brought to correct position. All those elements who are at its right are still not in their correct positions, similarly all the elements to its left. Position of this element forms the partition point. Once I get this index, I will re run my quick sort algorithm from lowest position to index - 1. Similarly from index + 1 till highest position. 

Well now we know some idea of quick sort, so what is its complexity? 
Hmmm its O(n^2). 

What?? then why we need this? We could have gone with Merge sort only right? 
Well first thing, this algorithm sorts inplace, where as merge sort needs extra space remember... Also average runtime is theta(nlogn). It is better than merge sort in most cases. 

Oh! then when does the worst case occur for quick sort?
When array is already sorted. But we can use random picking of pivot. Before starting of the pivot finding we choose a random index between low and high and then swap that element with highest index element. This kind of forces it to run in average case runtime.

Hmmm good now lets see the code.

So initial case is generating the recursive function which keeps on calling itself until all the elements are sorted.

int RandomizedQuickSort(int* pIntArray, int nlow, int nhigh)
{
int nReturnValue = 0;
int nKey = -1;

if (nlow < nhigh)

{
nKey = RandomizedPartition(pIntArray, nlow, nhigh);
RandomizedQuickSort(pIntArray, nlow, nKey - 1);
RandomizedQuickSort(pIntArray, nKey + 1, nhigh);
}

return nReturnValue;

}

To force the algorithm to run in average time, we need to randomize the pivot selection. So here is the code of RandomizedPartition which will randomly select an index and swaps its element with highest indexed element.

int RandomizedPartition(int* pIntArray, int nlow, int nhigh)
{
int nReturnValue = 0;

int nDiff = nhigh - nlow;
int nTemp = 0;

srand(time(0));


nDiff = rand() % nDiff;


nTemp = pIntArray[nhigh];

pIntArray[nhigh] = pIntArray[nlow + nDiff];
pIntArray[nlow + nDiff] = nTemp;

nReturnValue = PartitionQuickSort(pIntArray, nlow, nhigh);


return nReturnValue;

}

Finally, here is the quick sort. 

int PartitionQuickSort(int* pIntArray, int nlow, int nhigh)
{
int nCompareElement = pIntArray[nhigh];
int nTemp = 0;
int i = nlow - 1;

for (int j = nlow; j < nhigh; j++)

{
if (pIntArray[j] <= nCompareElement)
{
i++;
nTemp = pIntArray[i];
pIntArray[i] = pIntArray[j];
pIntArray[j] = pIntArray[i];
}
}

nTemp = pIntArray[i + 1];

pIntArray[i + 1] = pIntArray[nhigh];
pIntArray[nhigh] = nTemp;

return i + 1;

}

Basically I am trying to find the position of the last element in the given array. Aim is to find the position such that all elements which are less than that element should be before that position and all the elements which are greater that element should be after that position. Order of such elements is not a concern. So how do I find the correct position? I have two indices 'i' and 'j'. Index j is for iteration, i for position swapping. Each element starting from lowest index will be compared with the highest index element. If element is greater then just go ahead with doing anything. If smaller, than increase the index i, swap the elements of index i and j. Basically you are pushing the greater elements to other side. Go on doing this until j reaches highest - 1. Now i + 1 position is the correct position for the highest index element. So swap this element with element in i + 1 index. Finally i + 1 is your pivot.

Actually complete implementation is present here, but still this can be found in my github - https://github.com/harsha-kadekar/LibraryOfAlgorithms . Also implementation is based on the Chapter 7 - QuickSort of Introduction to Algorithms 3rd Edition [Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, Clifford Stein]