EGGSHELL: Global Consciousness Project Database Analysis Toolkit

EGGSHELL is a suite of programs which provide access to the data set assembled by the Global Consciousness Project, facilitating development of custom software for exploration and "data mining" of this vast and rapidly growing resource. Since innumerable questions can be posed regarding the data set, the programs, while providing commonly requested analyses, focus on efficient access to and manipulation of the data, minimising the amount of custom programming required for exploratory studies. In addition, the efficiency of optimised native-mode programs permits using them as the basis for data access tools and presentation software made available to the public on a Web server. The name "EGGSHELL" denotes programs, mostly run from the UNIX shell, providing access to the data collected by the worldwide "eggs" of the Global Consciousness Project network, as well as a "wrapper" for the raw egg data files, mediating access by analysis software and handling details such as exclusion of known-bad data and calculation of common statistical measures.

All of the programs comprising EGGSHELL are written in the C++ programming language and use the Standard Template Library (STL). In order to use this software, you'll need a recent, standards-compliant C++ compiler and library; all of these programs were developed using the GNU C++ compiler (g++) version 4.0.2 on an Intel Linux system running kernel 2.6.15-1.

The programs are written in the Literate Programming paradigm using Donald Knuth's CWEB programming language. As such, they are meant to be read as well as run; the programs serve as their own documentation and are intended to be entertaining and informative as well as efficient. Consequently, this document contains only an overview of the programs and pointers so you can read them on your own. CWEB programs automatically emit ready-to-compile C++ source code and documentation in the TeX documentation language. The tools for extracting code and documentation from CWEB programs are included in the EGGSHELL distribution, but you needn't use them nor learn the CWEB language yourself to employ the toolkit in your own programs; you're perfectly free to write standard C++ and link to the extracted C++ code, using the TeX files purely as a manual.

The links from the names of the programs below are to Adobe Acrobat PDF files produced from the TeX documentation. If your browser has a PDF plug-in, they will open directly in a new browser window. If your browser lacks the requisite plug-in, they will invite you to download the PDF file to a directory on your computer whence you may read with Acrobat Reader, which is a free download from Adobe, available for virtually all popular platforms. The PDF files for the documents are included in the source distribution which you may download from the bottom of this page--if you're planning to read the programs with a stand-alone copy of Acrobat Reader, it's probably easier to get the distribution complete with all of the PDF files instead of downloading them one-by-one from the individual links.

Example Programs

examples: Eight example programs are included in the CWEB program examples.w. (One advantage of CWEB is that a single self-documenting program can emit as many C/C++ source and header files as necessary while remaining organised in a logical fashion for the reader, free of the dictates of the compiler.) The examples illustrate various analyses using the toolkit facilities, beginning with simple cases which introduce the toolkit and progressing to more complicated calculations which explore its various data extraction and reduction components. Reading this program is the best way go "get into" the toolkit, exploring the other programs it uses as you encounter them.

Toolkit Component Programs

The toolkit facilities used by the example programs are contained in the following programs, discussed in more or less decreasing order of abstraction from the raw data. The heart of the toolkit is the analysis and eggdata programs, while the remaining programs provide utilities which can be used in isolation or in conjunction with the egg database access facilities.

analysis: This program implements higher level analyses of egg data sets, either original one day summaries or arbitrary time spans extracted from the collection of daily summaries.
eggdata: Provides tools for reading both egg data tables and auxiliary databases, such as the properties of individual eggs, known bad data which should be excluded from analyses, etc., with facilities for extracting and assembling data sets for analysis.
statlib: General purpose statistical library, which includes both tools for computing various probability distributions as well as descriptive statistics on data tables with a user-defined type.
timedate: Facilities for working with times and dates, using UNIX time_t quantities as the underlying type. Conversion to and from string representations, Julian day numbers, and computation of astronomical quantities such as sidereal time, the phase of the Moon, and the position of the Sun and Moon are provided.
fourier: Tools for computing Fourier and inverse Fourier transforms of data sets and determining power spectra from frequency domain information. Note: I am not an expert in this domain, and anybody using this code would be well-advised to look closely at its implementation and test it with known data before relying on its results.
colour: Tools for manipulating colours and colour spaces, both physical and perceptual. This program isn't currently used by any of the examples; it was implemented for eventual use in stand-alone graphics generation (eliminating the need for GNUPLOT post-processing), but may prove useful in "artistic" presentations of the data set.

Downloading and Installation

Download analysis.tar.gz (2.4 Mb)

The toolkit is supplied as a GZIPped TAR archive which extracts into the current directory. The CWEB programs are supplied as .w files, with the C++ (.c and .h) and TeX (.tex) files already extracted. Pre-generated PDF files for the documents are also provided.

To build the toolkit and example programs, you'll need a current C++ compiler and library (I used GCC/G++ 4.0.2 to develop them). After extracting the archive, build it with:

./configure make

If all goes well, when this process is complete, you'll have compiled object files for all of the toolkit components and ready-to-run executables for the example programs, example-1 through example-8. To run the examples, you'll need to have a copy of the Global Consciousness Project "eggsummary" and pseudorandom mirror files on your machine (or at least the ones used in the examples). The configure process automatically detects the locations of these files for the www.fourmilab.ch and noosphere.princeton.edu sites, but for other sites you'll need to add definition of the database locations to the eggdatabases class definition in eggdata.w (see the set_local_defaults method and those it calls). (Unfortunately, the current noosphere.princeton.edu server has neither g++ nor TeX installed, so it isn't possible to build or use these programs there.) When you modify a .w file, the Makefile automatically rebuilds the C++ programs and TeX documents it defines; the CWEB tools which accomplish this are included in the distribution.

If you have a complete TeX distribution loaded, you can rebuild the document for a program prog and view it in the TeX previewer with the command:

make prog.view

and update the PDF documents with:

make doc

Configuring Local Database Paths

If you're installing the analysis toolkit on a machine onto which you've copied all or part of the CSV format "eggsummary" files, you'll need to configure the directory name in which the files are kept. To permit the software to be installed on various analysts' machines, all of the examples initialise their eggdatabases by calling its:

    set_local_defaults();

method The "./configure" script uses the "hostname" utility to obtain the name of the machine it's running on, and embeds this in the Makefile and configuration as a definition of the C macro HOSTNAME. If this is defined, the set_local_defaults() method is defined in eggdata.w, which tests for known hosts and sets the path names appropriately. To define a new host, add a new set_hostname_defaults to the eggdatabases class definition in eggdata.w, using the existing set_Fourmilab_defaults() and set_noosphere_defaults() as a model, then add a case for the hostname to the set_local_defaults() method immediately below it in the file. After you've tested the definitions for your host, please send me a copy of the code you added so I can incorporate into eggdata.w in the next release. That way you won't have to keep modifying the file every time a new release is posted.

If you don't want to add definitions for your host to eggdata.w, you can manually initialise the eggdatabases object with code like the following:

    eggdatabases ed;
    ed.add_database("gcp", "/home/httpd/html/data/eggsummary");
    ed.add_database("pseudo", "/home/httpd/html/data/pseudoeggsummary");

where the call with the argument of "gcp" specifies the path name of the directory containing the eggsummary files for the data taken by the egg network, and the call with "pseudo" the path for the pseudorandom mirror data generated by the GCP host. If you're only using one of the databases in your analyses, you needn't provide a path for the other. The eggsummary files in the directories may be compressed with GZIP.

Maintaining Data Integrity

The analysis software relies on two Comma-Separated Value (CSV) databases supplied with it which identify known bad data in the data set and specify physical properties of "egg" hosts in the network. These files were current as of the date the archive was posted, but it's up to you to verify that they're correct if you're analysing data collected subsequently. The files are as follows:

eggs.csv: Properties of "egg" sites in the Global Consciousness Project network. This is a machine readable encoding of the table posted on the GCP site.
rotten_egg.csv: Known bad data in the GCP database, identified by egg number and the time range during which the data were unreliable. The source document is the table of errors on the GCP Web site; you should confirm that no additions have been made to this table before performing definitive analyses. (Note that data collected in the past may be subsequently discovered to be erroneous, so even data taken prior to the date of the rotten_egg.csv file in the distribution may be retracted if discovered bad.) We don't delete bad data from the data set in order to preserve its integrity and visibility; better to require analysts to exclude bad data than risk allegations of "cooking the books" by arbitrarily excluding data.

These files are not automatically generated nor frequently updated to reflect changes in their source documents. If you use these files, it is your responsibility to integrate any changes posted on the Global Consciousness Project site subsequent to their compilation.

EGGSHELL Global Consciousness Project Database Analysis Toolkit