Dec 30 2008

Query for Google PageRank

Karthik Nadig

If you have used Google toolbar, you might have used the PageRank tool. It is a nifty tool, which shows PageRank of page or web site you are currently viewing. I was trying to figure out how it works. After googling for it and filtering out the sporadic bits of information, this is what I got.

Suppose you want to find the PageRank of some web site, lets just say Microsoft (www.Microsoft.com). You can either visit the web site and get page rank from the Google Toolbar or use web sites which are designed for this purpose like www.prchecker.info. In either case they generate a query which will look similar to this:

http://toolbarqueries.google.com/search?client=navclient-auto&ch=62624488910&ie=UTF-8&oe=UTF-8&features=Rank&q=info:http%3A%2F%2Fwww.microsoft.com

Go ahead click on the above link you’ll get some text in the browser window that appears like this:

Rank_1:1:9

The last number in the above string is the PageRank.

Most parts of the above query remain constant, the values that change are that of ch and q. The the value of q is very easy to generate, but the value of ch will need a bit of work. Let me explain the easy one first. Lets trim the fat and concentrate on the value of q.

info:http%3A%2F%2Fwww.microsoft.com

There are two parts to this value first is the "info:" part which is constant for any query, and the "http%3A%2F%2Fwww.microsoft.com" which is an escaped version of the original query that you requested( i.e., www.microsoft.com). "http://" has to be prefixed to the original query if it was excluded in the original string.

Any string an be easily converted to escaped string in almost all scripting languages. In Perl its uri_escape(), in .Net its Uri.EscapeDataString(), other scripting languages will have similar method to get the  escaped string.

These are the steps to generate the value of q:

  1. Check if http:// is prefixed in the input string, if not then include it (original_uri).
  2. Get the escaped version of the input string (escaped_uri).
  3. Value of q = "info:" + escaped_uri.

Now that the easy part is done lets move to the other one, ch. The value of ch in generated by hashing the input uri. So, the input to the hash function will have the following format, "http://….". This is a pretty simple algorithm, I’ll try to make it as generic as possible. I have broken the algorithm into 3 parts, Hash1, Hash2 and MainHash.

Hash1:

Input: Unsigned Integers a, b, c.

Output: Unsigned Integers a, b, c.

Algorithm:

   1:  a = a - b 
   2:  a = a - c
   3:  a = a xor (c right_shift 13)
   4:  b = b - c
   5:  b = b - a
   6:  b = b xor (a left_shift 8 )
   7:  c = c - a
   8:  c = c - b 
   9:  c = c xor (b right_shift 13)
  10:  a = a - b
  11:  a = a - c
  12:  a = a xor (c right_shift  12)
  13:  b = b - c
  14:  b = b - a
  15:  b = b xor (a left_shift  16)
  16:  c = c - a
  17:  c = c - b
  18:  c = c xor (b right_shift 5)
  19:  a = a - b
  20:  a = a - c
  21:  a = a xor (c right_shift 3)
  22:  b = b - c
  23:  b = b - a
  24:  b = b xor (a left_shift  10)
  25:  c = c - a
  26:  c = c - b
  27:  c = c xor (b right_shift 15)

Output: values of a, b, c

 

Hash2:

Input: String (character indexing should be allowed)

Output: Unsigned Integer

Algorithm:

   1:  Unsigned Integer a = 0x9e3779b9
   2:  Unsigned Integer b = 0x9e3779b9
   3:  Unsigned Integer c = 0xe6359a60
   4:  k = 0
   5:  len = Length of input_string
   6:  Unsigned Integers va,vb,vc
   7:   
   8:  Repeat until len >= 12
   9:      
  10:      va = input_string[k + 0]
  11:      va = va | (input_string[k + 1] left_shift 8 )
  12:      va = va | (input_string[k + 2] left_shift 16)
  13:      va = va | (input_string[k + 3] left_shift 24)
  14:      a = a + va
  15:   
  16:      vb = input_string[k + 4]
  17:      vb = vb | (input_string[k + 5] left_shift 8 )
  18:      vb = vb | (input_string[k + 6] left_shift 16)
  19:      vb = vb | (input_string[k + 7] left_shift 24)
  20:      b = b + vb
  21:   
  22:      vc = input_string[k + 8]
  23:      vc = vc | (input_string[k + 9] left_shift 8 )
  24:      vc = vc | (input_string[k + 10] left_shift 16)
  25:      vc = vc | (input_string[k + 11] left_shift 24)
  26:      c = c + vc
  27:   
  28:      a, b, c = Hash1 a, b, c
  29:      k = k + 12
  30:      len = len - 12
  31:   
  32:  end of loop
  33:   
  34:  c = c + Length of input_string
  35:   
  36:  if (len > 10) c = c + (input_string[k + 10] left_shift 24)
  37:  if (len > 9) c = c + (input_string[k + 9] left_shift 16)
  38:  if (len > 8 ) c = c + (input_string[k + 8] left_shift 8 )
  39:  if (len > 7) b = b + (input_string[k + 7] left_shift 24)
  40:  if (len > 6) b = b + (input_string[k + 6] left_shift 16)
  41:  if (len > 5) b = b + (input_string[k + 5] left_shift 8 )
  42:  if (len > 4) b = b + (input_string[k + 4])
  43:  if (len > 3) a = a + (input_string[k + 3] left_shift 24)
  44:  if (len > 2) a = a + (input_string[k + 2] left_shift 16)
  45:  if (len > 1) a = a + (input_string[k + 1] left_shift 8 )
  46:  if (len > 0) a = a + (input_string[k])
  47:   
  48:  a, b, c = Hash1 a, b, c

Output: Unsigned Integer c.

 

MainHash

Input: String (character indexing should be allowed)

Output: Unsigned Integer

Here that algorithm get a little bit complicated. So, I’ll explain it in bit more detail. First two steps are direct, others need explanation, so here they are:

  1. Unsigned Integer c = Hash2 input_string
  2. c = ((c mod 0x0d) & 7) | (( c / 7) << 2)
  3. Generate a mapped string using f(x, c). This is explained below. So, new_string = string generated from table.
  4. Unsigned Integer c = Hash2 new_string
  5. Final Hash value = Concatenate "6", string version of c.

To generate a table and calculate f(x, c) for where x ranges from 0 to 19.

f(x, c) = x * 9 + c

In the following table I am using the values generated for http://www.microsoft.com . c = 404068858.

x f(x, c) Hex Little-endian hex format String
0 404068858

0x181599fa

0xfa, 0x99, 0x15, 0x18

ú™

1 404068849

0x181599f1

0xf1, 0x99, 0x15, 0x18

ñ™

2 404068840

0x181599e8

0xe8, 0x99, 0x15, 0x18

è™

3 404068831

0x181599df

0xdf, 0x99, 0x15, 0x18

ߙ

4 404068822

0x181599d6

0xd6, 0x99, 0x15, 0x18

֙

5 404068813

0x181599cd

0xcd, 0x99, 0x15, 0x18

͙

6 404068804

0x181599c4

0xc4, 0x99, 0x15, 0x18

ę

7 404068795

0x181599bb

0xbb, 0x99, 0x15, 0x18

»™

8 404068786

0x181599b2

0xb2, 0x99, 0x15, 0x18

²™

9 404068777

0x181599a9

0xa9, 0x99, 0x15, 0x18

©™

10 404068768

0x181599a0

0xa0, 0x99, 0x15, 0x18

™

11 404068759

0x18159997

0x97, 0x99, 0x15, 0x18

—™

12 404068750

0x1815998e

0x8e, 0x99, 0x15, 0x18

Ž™

13 404068741

0x18159985

0x85, 0x99, 0x15, 0x18

™

14 404068732

0x1815997c

0x7c, 0x99, 0x15, 0x18

15 404068723

0x18159973

0x73, 0x99, 0x15, 0x18

16 404068714

0x1815996a

0x6a, 0x99, 0x15, 0x18

17 404068705

0x18159961

0x61, 0x99, 0x15, 0x18

18 404068696

0x18159958

0x58, 0x99, 0x15, 0x18

19 404068687

0x1815994f

0x4f, 0x99, 0x15, 0x18

Note: This is a sample table, you need to generate this table for each query.

To generate the table, first step will be to apply the value of "x" in the function f(x, c),  convert the result to 32 bit Hex values and reorder the bytes in little-endian format. Next step is to convert the individual Hex bytes to corresponding characters, which will give you a string that will appear similar to that shown in the last column. Finally you have to append the strings together.

After appending all of the strings together the string will appear as shown below.

ú™ñ™è™ß™Ö™….

Let me show what happens to the values in each step of five steps in this algorithm with http://www.microsoft.com as input. Remember this is the main search query not the input to this algorithm. Input to this Hash algorithm will come from another function, I’ll get to that later.

  1. Unsigned Integer c = Hash2 input_string. (result: c=707120502)
  2. c = ((c mod 0x0d) & 7) | (( c / 7) << 2).  (result: c=404068858)
  3. String generated by creating the table. (result: new_string=ú™ñ™è™ß™Ö™…. )
  4. Unsigned Integer c = Hash2 new_string. (result: c=2624488910)
  5. Final Hash value = Concatenate "6", string version of c. (result: c=62624488910)

That ends the MainHash algorithm. Now to bring it all together in a final step to generate the query string. This is the main part of the entire query building process.

  1. Check for "http://" in the input query string, if "http://" is missing then prefix it to the query string. Lets call it checked_query_string.
  2. Initialize base_query_string with constant string  "http://toolbarqueries.google.com/search?client=navclient-auto&ie=UTF-8&oe=UTF-8&features=Rank".
  3. info_query_string = concatenate "info:" and checked_query_string.
  4. Generate code string. code_string = concatenate "&ch=" and MainHash of info_query_string.
  5. Info string. info_string = concatenate "&q=info:" and Escaped version of checked_query_string .
  6. final_query_string = concatenate base_query_string, code_string and info_string.

Here is a sample result of the above algorithm:

http://toolbarqueries.google.com/search?client=navclient-auto&ie=UTF-8&oe=UTF-8&features=Rank&ch=62624488910&q=info:http%3A%2F%2Fwww.microsoft.com

The result of this query will be: Rank_1:1:9

The last number "9" is the PageRank.

Wow, you read through the whole thing, hats off to you.


Dec 13 2008

How digital creatures can learn binary math

Karthik Nadig

I was working on a project to find out the optimal network structure for a multi-layered perceptron, and I found that this technique could be very useful. All though, in this example I am keeping the concept simple, the network structure that I have used here is a fixed one, but the best network is selected using Genetic Algorithm. In this example I have created a artificial world where the creatures can survive to the next generation if they learn faster.

I have created a simple environment for the digital creatures where they can only survive if they can perform binary addition.  The purpose of this experiment was to create an environment where these digital creatures evolve and learn binary addition in the process. I am forcing them to learn binary addition, for quicker results.

Each digital creature is composed of 7 neurons, in a 2:3:2 layout. The network of neurons is of fully connected type. The DNA sequence of these creatures defines the connection strength between the various neurons. The first generation of digital creatures has connection weights selected by using random values. With each stage they provide the basic information required to survive with the environment to the next generation.

Each generation gains some experience by living within the environment for certain number of cycles. During this stage they are trained using the Back-Propagation algorithm. They are trained for only a certain number of cycles after which the elites in the population are chosen. Elites are those which have the best chance of survival, and are selected as members of the next generation. The remaining members for the next generation are created by paring the members of the current population.

The difference between the elite members and other members is that elite members have their DNA intact, where as other members will have modified versions of the DNA. The DNA is modified by a process called Crossover, where the paired members contribute a part of their DNA to the next generation. This new DNA may sometimes be further modified by another phenomenon – Mutation. In Mutation a randomly selected gene gets replaced by a new gene. Both, Crossover and Mutation may have either positive or negative effects on the creature which will be formed using the modified DNA.

The Algorithm:

The first step is to create a base population and allow it to train. Then check if the required state is reached, if not, then  move to next generation. Train the new generation and again check if the required state is reached. Repeat until the optimum solution is obtained. Flow diagram for the algorithm is shown below:

FlowDiagram

  1. Creating the First Generation: This step involves creating the Neural Network for each of the member for the first generation. The network is initialized with weights generated using Nguyen-Widrow algorithm. The input and output values are scaled so that they are normalized.
  2. Training: Each member of the population is run through a predefined number of training cycles. The ANNs are trained using the Back-Propagation Algorithm. The version of back-propagation algorithm used in the example has been slightly modified, so that the convergence is faster. Also, the training samples are shuffled before each training cycle.
  3.  Required State: To test the required state, a sample input is given and the output which is received will be compared with expected output. If the difference is lower than a predefined error threshold then the required state is reached and the process will be stopped. Otherwise, the process will be repeated until it is reached.
  4. Creating the Next Generation: To create the new generation we need to select the elites first and then we can create more members by randomly paring the members of the current population. Elites are those members who have the least error when a sample input is provided. The number of elites in a population is selected based on a ratio which will be around 10% to 30% of the total population. The remaining members for new generation are created by crossover paring. Here, certain traits from each of the member in a pair are selected and a new DNA is formed. This new DNA will be used to create a member for the next generation. During crossover Mutation can occur and occurrence of this is determined by a probability factor.

The Digital Creature:

As I had mentioned earlier, each digital creature has 7 neurons arranged in 3 layers. The neuron connection diagram is shown below (each of the circles represent a neuron). That’s all there is to it, each creature is a simple fully connected Multi-Layer Perceptrons(MLP).

ANNStructure

The DNA:

For this project I have used a mapped DNA, i.e., each of the weights that are marked in the neuron structure diagram are directly map to a gene on the DNA. The figure below will give you a better idea.

DNAStructure

Crossover and Mutation:

Consider two DNA strands, let’s call them DNA1 and DNA2. With the crossover process we’ll end up with a new DNA that has some genes for DNA1 and some from DNA2. Since in this project I have directly mapped neuron connection weights to genes, the new DNA contains connection weights of the contributing DNA strands. As for the mutation part, it’s just a random value, but the value is selected to be within the range of the original DNA strand.

NewDNA

Advanced form of the example:

In an advanced form of this example, instead of the direct mapping of the genes to the neuron connection weights, there would be a more complex DNA which had control over number of layers and number of neurons per layer also the type of transfer function for each layer of neurons and finally the weights. To make in more interesting instead of forcing them to learn binary addition, they would be allowed to develop it by reward strategy. Though this is interesting, the time it takes for them to develop binary math will be too long. So, for now this example should be enough to demonstrate the possibility.

Working with the project files:

The project was built using Visual Studio 2008. Just extract the contents of the ZIP file and run the project. The default values are preset, so just press "Run". The result will be displayed on the right side, which will show the number of generations it took to reach the prescribed error threshold, current error and also the result of binary addition using the elite member of the latest generation.

If only "Android 18" was so easy to build.