Dev-Team Info for the ArcSpread Project

Still lots to add, and probably to correct. Work in progress.


I had more than this list below, can't find them right now. I'll add as I remember.

Useful Random Hints


For building and testing we Maven, a Java-centric build tool.

Even as you do your first little test coding, go ahead and do it in a Maven-sanctioned code tree structure. You get that structure created automatically with the command:

mvn archetype:generate -Dfilter=quickstart \
-DgroupId=edu.stanford.arcspread.mypackage \
-DartifactId=MyProject \
This will generate a tree with the following path to where your code then goes:
Your code goes into mypackage. The process will have put a file called in that directory.

Each Maven project is known and unique throughout the world via its coordinates: groupId, artifactId, and packageName. For example, the above creates a source tree for the PhotoSpread Maven project: PhotoSpread, PhotoSpread, edu.stanford.photoSpread. The '-D' passes an argument into a Java program. Maven's command 'mvn' is a Java program. Let's make our groupId be ArcSpread. Your individual projects will each have a different artifactId, which you can invent. For example: wordBrowser. Let's have all our packages start with edu.stanford.[yourArtifactName] Once you issued the above command for your artifact, you'll have a pom.xml file in the root of your new tree. That's where any dependencies on outside libraries are recorded. If you put code into [rootdir]src/main/java/..., which has been created for you, you just cd to your [rootdir], and run:

mvn compile

For other actions than compile (like 'compile your test code,' 'do 'testing,' 'package your stuff into a jar file that I can run on my machine', etc., look for the keyword 'life phases' in the Maven literature.

Some other useful Maven commands:


Git has a huge command set. I only use a handful of commands: You can find your own style of working with Git, but when I develop on my own, I feel safest just keeping a straight line of branches that I name by the dates I worked on them. Like this. Assume it's Oct12, 2011, and I have a branch called Sep25_2011, which is my currently checked out (i.e. active) branch. I start the day doing this:
git branch Oct12_2011 git checkout Oct12_2011
Now I change the code. When I'm done for the day, I do:
git push origin Oct12_2011:Oct12_2011
This will create a new branch in the remote repo, with the same name as the local branch I created in the morning. It's an extremely conservative use of Git, but it works for me. Feel free to be more adventurous, creating parallel branches, and merging them.

After some research I decided on diffuse as my file diff viewing and merging tool. To make git use diffuse for the 'git mergetool' command (after you installed diffuse:
git config --global merge.tool diffuse

Git commands beyond the basics

The following commands are mostly from the Web. I did try them, but there could be typos. Try anything new with Git on an example first. Git can bite.

Github is our Code Repository

We'll use two facilities on Github. Several code repositories, and one repository for intra-project info. Feel free to add pages and links to this index.htmlfile. Current repos on Github are:

Machine Room Info

Structuring Interaction Between Sheet Engine and Machine Components

In Github repo PigIRAnt, in directory PigScripts/CommandLineUtils you'll find how I envision the machine room facilities to work. Each processing module is made up of two files, plus associated Java User-Defined Functions (in src): One PigScript that does the processing, and one Bash shell script that serves as a console command that invokes the Pig script. Each shell script provides usage info when invoked with -h, --help, or no parameters.

Each Pig script uses the WebBase loader, or the WARC loader to pull in Web pages. The scripts' outputs are usually files in HDFS that can be consumed directly by the upper layers, or can be moved into SQLite.

The spreadsheet engine will invoke the shell scripts as OS calls from Java.

Our Cluster

We have a roughly 60 node cluster in the basement of the Gates building. The main machine among those is ilc0 (for info-lab-zero). You'll need an account on that machine. That's where you do your full tests. The machine is an HDFS and a regular home directory storage section (/home/[userName]). You put your Pig scripts and corresponding shell scripts into the user section. Results will show up in HDFS.

Using Hadoop

Interacting with the HDFS file system from a shell command line on ilc0: Here are some useful aliases to put into your .basrc (or other shell) startup file: Various URLs and files where you can monitor what Hadoop is doing:

Using Pig

Random Pig hints:

Interacting With WebBase Via a Browser

To see which crawls are available, and which sites each crawl covered, you can interact with WebBase directly, via the Web, that is independent of Hadoop and the Pig loader.

Go to Once there, find the paragraph on Wibbi, and click on the link there.

You'll find a page that lets you define a stream of pages from one crawl. On the first page you specify how many pages you want, and how you want them filtered. On the next page you'll specify which crawl you want.

On that crawl selection page you'll see the crawl names in the first column. That's the name the Pig WebBase loader needs to find the crawl.

When you hit the download button in one of the rows, your browser will ask you where you want the impending stream to be stored. The file you specify there will hold all the pages you download.

Sample Hadoop-Created Crawl Info Files

For reference I created two datasets as example for what we will deal with out of the Hadoop processing step. One is a csv wordcount file for (part of) the March 2007 government crawl. The second is a part of speech tagged csv file for first 1000 pages of the June 2007 general crawl. You find them at

These are good examples: the wordcount has simple strings or numbers in its comma-separated columns. But the POS file has little two-tuples. HongXia's DB library will hide this difference.

Useful Software

Page Processing Utilities

Several classes serve both as modules within Hadoop jobs, and for use within your Java applications. These classes are combined in the git package PigIR.

The following utilities are currently available: Access these facilities via the four classes in edu.stanford.pigir.arcspread. They each include a main() method with an example.

Accessing the Page Processing Utilities From Your Code

You most easily use this code by splicing the following into your pom.xml:
<name>Stanford Infolab Maven Repository</name>

This should resolve your access to these utilities both on the command line (mvn compile), and within Eclipse, if you import your code as a Maven project.

Then, in your code, import what you need. For example, to use the part-of-speech tagger:
import edu.stanford.pigir.arcspread.POSTagger;

For the WebBase page extraction utility:
import edu.stanford.pigir.webbase.DistributorContact;
import edu.stanford.pigir.webbase.WbRecord;
import edu.stanford.pigir.webbase.wbpull.webStream.BufferedWebStreamIterator;

Distributing your Code

If your code uses the PigIR utilities, then your users will need access to the PigIR. The easiest way to do this is to splice this to your pom.xml dependencies section:
If you want to use the part of speech tagger directly (no longer recommended: deprecated in favor of the above POSTagger utility.
  • If you want to use the Stanford Part-Of-Speech (POS) tagger, add the following repository information into your pom.xml file (if you already have a repositories or dependencies entry in your pom.xml, splice the two entries below into those existing elements):
    <groupId>Stanford Infolab Maven Repository</groupId>
    <name>Stanford Infolab Maven Repository-releases</name>


    This will automatically download the jar file, and adjust your Java path to find entries within it.