Google Page Rank Query

by John Walker


The order in which a Google search displays results, apart from “sponsored links” and promotion of Google's own offerings over those of others, depends upon the “PageRank™” of the item. While the basic algorithm is described in the 1998 paper “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Google founders Sergey Brin and Lawrence Page, the details of the exact algorithm used by Google are closely guarded and changed from time to time with the goal of defeating attempts to “game” the ranking algorithm. Despite persistent claims to the contrary, pigeons play no part in Google's ranking of pages.

Given how much of the traffic to many Web sites is directed there by search engines and Google in particular, most site managers are keenly interested in the ranking of pages at their own site and how they compare with the competition. Google does not provide an option to display the PageRank of sites in search results, but a variety of browser plug-ins, query services, and meters one can embed within Web pages which display PageRank have been developed. Many of these solutions have disadvantages: they may only work with certain browsers, disclose a user's search activity to a third party, or limit the number of requests and/or force the user to solve a so-called CAPTCHA puzzle for each query.

This page describes a utility, written in Perl, which allows you to display Google PageRank information in a variety of ways:

Command Line Queries

The simplest way to request the Google page rank for a URL is to invoke the page_rank.pl utility from the command line (shell) prompt. You can display the page rank by navigating to the directory where you've extracted the utility and specifying the URL whose page rank you wish on the command line. For example, here are several queries in succession (page ranks were those returned when this example was run on January 1st, 2007; they may have changed subsequently). The “$” character represents your system's command prompt; do not type it before the command.

    $ ./page_rank.pl http://www.google.com/
    10
    $ ./page_rank.pl http://www.gutenberg.org/
    8
    $ ./page_rank.pl http://www.spam.com/     
    7

The page rank (a number between −1 [indicating no ranking is available] and 10 [the highest]) is simply written to standard output without any adornment. This makes it easy to query page rankings in scripts using the “back-tick” mechanism. Since command line requests are processed on your own computer and obtain the rankings directly from Google, there are no restrictions (apart from any imposed by Google) on the number of requests you can make or which pages you query.

Web Query Form

If you manage a Web server and have the ability to install Perl CGI programs on it, you can implement a Web form like the following which will allow you and others you designate to obtain the page rank of arbitrary Web pages from any browser. Simply enter the URL for the page in the “Address” box and press the “Get Page Rank” button. If you install such a form on a public Web server, you may find it quickly discovered and deluged by a multitude of requests submitted by total strangers. To keep this from happening, requests are processed only if they specify an “API Key” which identifies the requester as authorised. This can be any sequence of characters you designate, but a user's E-mail address is an easy to remember choice.

In the following example, I've specified the Fourmilab site and an API Key of “chef@ratburger.org” (no, this is not a valid E-mail address!). To keep this form from being hijacked by third parties, the fields are read-only and this particular API Key can be used only to obtain the page rank of the www.fourmilab.ch site.

Address: 

API Key: 

 

The result returned from a Web form inquiry embeds the API Key as a hidden form field. You can bookmark the result URL and use it to make subsequent requests without the need to enter your API Key each time.

Embedded Page Rank Meters

Invoking the page rank CGI program within an HTML image (<img>) element allows you to include a graphical page rank meter showing the current rank of the page or any other page you designate. For example, here are the rankings of four sites.

http://www.w3.org/ PageRank: http://www.w3.org/
http://www.cnn.com/ PageRank: http://www.cnn.com/
http://www.fourmilab.ch/ PageRank: http://www.fourmilab.ch/
http://www.drudgereport.com/ PageRank: http://www.drudgereport.com/

The rankings shown in the meters are “live”—they were retrieved from Google when your browser displayed this page; if the rankings should change, the meters will reflect this when the page is refreshed. If you have installed the page rank query program as “PageRank” in the cgi-bin directory of your Web server, you can include a meter which will show the page rank of the page which includes it with the following HTML:

    <a href="http://www.fourmilab.ch/webtools/PageRank/"><img
    	 src="/cgi-bin/PageRank?uri=referer"
         alt="Google PageRank of this page"
         width="80" height="15" border="0" /></a>

To display the rank of a different page, specify its URL in the “uri” argument:

    <a href="http://www.fourmilab.ch/webtools/PageRank/"><img
    	 src="/cgi-bin/PageRank?uri=http://www.southparkstudios.com/"
         alt="Google PageRank of South Park Studios"
         width="80" height="15" border="0" /></a>

In both of these examples I have supplied a “courtesy back-link” to this page so folks who are interested in installing page rank meters in their own pages can discover how by clicking the meter in your page. Such back-links are appreciated but not a requirement for using the page rank meter.

All possible values for the page rank meter and their correspondence with numerical page ranks are as follows.

−1 PageRank not available     3 PageRank 3     7 PageRank 7
0 PageRank 0     4 PageRank 4     8 PageRank 8
1 PageRank 1     5 PageRank 5     9 PageRank 9
2 PageRank 2     6 PageRank 6     10 PageRank 10

Downloading and Installation

Note: The Web-based page rank query is a Common Gateway Interface (CGI) program written in the Perl language. Installing a CGI program requires detailed knowledge of the Web server configuration of the system on which it is to be installed, and may require administrative (super-user) privilege to install. Since Web server configurations differ widely from system to system, there's no cookbook approach to installing a program such as this—you need to understand what you're doing, and know how to track down and fix problems based on error messages in the HTTP server error log.
To install the page rank query utility on your Web server, perform the following steps.

  1. Download the distribution archive:
    page_rank-1.1.tar.gz
  2. Uncompress and extract the archive into a new empty directory. This directory may be located anywhere on your Web host, but must be readable by the Web server application. It is best not to install the distribution directly into the server's CGI program directory.
  3. Edit the page_rank.pl program. The first line specifies the location of the Perl interpreter on your system; on Unix-like systems you can determine this with the shell command “which perl”. This location should be entered on the first line of page_rank.pl, following the “#! ” characters. Review the Detailed Configuration section below and configure the variables at the top of the program for the security options appropriate for your site.
  4. Save the modified page_rank.pl file.
  5. Test the page_rank.pl file by invoking it from the command line with a URL argument such as:
    ./page_rank.pl http://www.fourmilab.ch/
    If you encounter problems (bad path to the Perl interpreter, missing Perl modules, etc.) correct them. You will have to install the Perl modules LWP::UserAgent and URI::Escape if they are not present on your machine.
  6. Edit the PageRank shell script. Replace the path name on the cd command with the full path of the directory in which you installed page_rank.pl and the subdirectories from the distribution.
  7. Copy the PageRank script into your Web server's CGI binaries directory. Make sure the program has execute permission for all users (on Unix-like systems, you can use the command “chmod 755 PageRank” set global execute permission).

Creating Custom Web Query Forms

While you can obtain a blank page rank query form by directly invoking the CGI program with no arguments, you may wish to embed a query form in a page on your site, optionally specifying custom arguments for the request. A basic request form may be added with the following HTML code:

    <form method="get" action="/cgi-bin/PageRank" target="_blank">
    <p>
	<input type="hidden" name="html" value="1" />
	Address:&nbsp;<input type="text" name="uri" size="60"
	    maxlength="1024" value="" />
	</p><p>API&nbsp;Key:&nbsp;<input type="text" name="apikey"
	    size="24" value="" />
    </p>
    <p class="c">
	<input type="submit" value=" Get Page Rank " />  
	<input type="reset" value=" Reset " />
    </p>
    </form>

If your server uses a different CGI program directory, adjust the “action” attribute in the form tag accordingly. You may change the “method” for the form to post if you wish; this will keep the API Key from being displayed in the browser's URL bar, but make it more difficult for users to bookmark queries. You're free to change the label (“value”) on the submit button to whatever you wish, or use a graphical button instead of a text button.

Customising Form Requests

You can customise the behaviour of PageRank by including the following hidden input fields within the form which invokes the script.

<input type="hidden" name="debug" value="1" />
The CGI program's environment variables will be displayed in the reply document. This allows seeing how the server passed arguments to the program and how they were parsed internally, and can be useful when diagnosing configuration problems.
<input type="hidden" name="html" value="1" />
The result will be returned as an HTML document (XHTML 1.0 Strict DTD to be precise). The result document contains an address field and request button which allows you to make subsequent requests using the same API Key supplied in the initial request.
<input type="hidden" name="text" value="1" />
The page rank will be returned as a simple number (−1 to 10) in a document of “text/plain” type. You can use this specification when obtaining page ranks within a script which invokes a command line Web client such as Wget or cURL. If neither html or text mode is specified, the result will be an “image/png” document containing the graphical page rank meter.
<input type="hidden" name="apikey" value="keytext" />
This argument allows you to pre-specify the API Key for a request. Do not use this specification in a page available to the public, as it will defeat the protection the API Key is intended to provide! You may wish to use this specification in a request form stored on your local computer and accessed with a file: URL to avoid the need to enter your API key for each request.
<input type="hidden" name="uri" value="URI" />
You can use this specification to specify the URI/URL (Web address of the page whose rank you wish, for example: http://www.fourmilab.ch/) instead of allowing the user to enter the address in a text field. If set to the special value “referer”, the URL of the page containing the request form will be used automatically, which permits you to include page rank meters on a collection of pages without the need to specify the specific URL for each.

Detailed Configuration

You can customise the configuration of page_rank.pl in various ways by changing variable definitions at the start of the script.

$restrictReferer If $restrictReferer is non-null, it is used as a regular expression to test the HTTP_REFERER field received from the Web server. This can (and should) be used to restrict access to pages originating at the site where it's hosted. If your site has several aliases (ratburger.org, ratburger.net, etc.) or permits access by IP address or IP address range, you'll have to craft a regular expression which matches everything you may see as a referer. Here, for example, is the declaration used for the Fourmilab server:
my $restrictReferer = qr{^http://((((www|server\d)\.)? fourmilab\. (ch|com|net|org|to))| 193\.8\.230\.\d+)/}x;
  Please note that while $restrictReferer provides protection against naïve users who hijack your server's PageRank application by embedding references to it in their own Web pages, your server is still subject to abuse by clever freeloaders who use programs like Wget and cURL or Web programming toolkits which allow forging the referring page specification. To guard against such pocket-picking, use a $restrictURI (described below) to limit queries to pages on your own site or those referenced in your own pages.
$restrictURI If $restrictURI is non-null, it is used as a regular expression to test the URI/URL for which the query is being made. Only if it matches is the request processed; otherwise the query is aborted without any indication of the reason to the (ab)user. (The site manager is alerted to the incident by a message in the HTTP server error log.) To restrict queries to pages on the fourmilab.ch site, one might use:
my $restrictReferer = qr{^http://(www\.)?fourmilab\.ch/};
%API_keys This hash is initialised to one or more key/value pairs where the key(s) are the API Keys accepted by Web form queries and the values can be anything (I just set them to the number 1). API Keys can be any string value you can enter in an HTML text field. If you're granting access to your Web query form to individual visitors to your site, you might consider using their E-mail addresses as API keys; that will be easy for the visitors to remember and identify them when you're reviewing the list. If your own E-mail address is well known and associated with the site on which the form is published, it's unwise to use it as your own API key since it's easy for an interloper to guess. Instead, it's wiser to choose a difficult to guess sequence as you'd use for a password; you can always bookmark a result page with the API key embedded to avoid having to enter it every time. Here is a specification which grants access to three users, one with a password-like key (generated by HotBits) and two via their E-mail addresses.
my %API_keys = ( 'hXGZ*.+GqI%y', 1, 'distims@gostak.org', 1, 'ernst@blohard.com', 1 );
  If you do not specify any API Keys you will not be able to make requests via the Web form interface.

Acknowledgements

This program uses Yuri Karaban's WWW::Google::PageRank Perl module to obtain page rankings from Google's server. That module, in turn, uses the LWP::UserAgent and URI::Escape modules, both developed by Gisle Aas.

Google and PageRank are trademarks of Google Inc.


This software is in the public domain. Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, without any conditions or restrictions. This software is provided “as is” without express or implied warranty.

Fourmilab Home Page