<id>mono.stanford.edu</id>

<name>Stanford Infolab Maven Repository</name>

<url>http://mono.stanford.edu:8081/artifactory/ext-release-local</url>

</repository>

</repositories>
...

<groupId>PigIR</groupId>

<artifactId>PigIR</artifactId>

</dependency>

</dependencies>
...

This should resolve your access to these utilities both on the command line (mvn compile), and within Eclipse, if you import your code as a Maven project.

Then, in your code, import what you need. For example, to use the part-of-speech tagger:
import edu.stanford.pigir.arcspread.POSTagger;

For the WebBase page extraction utility:

import edu.stanford.pigir.webbase.DistributorContact;

import edu.stanford.pigir.webbase.WbRecord;

import edu.stanford.pigir.webbase.wbpull.webStream.BufferedWebStreamIterator;

Distributing your Code

If your code uses the PigIR utilities, then your users will need access to the PigIR. The easiest way to do this is to splice this to your pom.xml dependencies section:

<groupId>PigIR</groupId>

<artifactId>PigIR</artifactId>

</dependency>

If you want to use the part of speech tagger directly (no longer recommended: deprecated in favor of the above POSTagger utility.

If you want to use the Stanford Part-Of-Speech (POS) tagger, add the following repository information into your pom.xml file (if you already have a repositories or dependencies entry in your pom.xml, splice the two entries below into those existing elements):

<repositories>
<repository>
<groupId>Stanford Infolab Maven Repository</groupId>

<name>Stanford Infolab Maven Repository-releases</name>

<url>http://mono.stanford.edu:8081/artifactory/ext-release-local</url>

</repository>

</repositories>

<dependencies>
<dependency>
<groupId>stanford-postagger</groupId>

<artifactId>stanford-postagger-with-model</artifactId>

<version>2011-04-20</version>

</dependency>

</dependencies>

This will automatically download the jar file, and adjust your Java path to find entries within it.

Dev-Team Info for the ArcSpread Project

Conventions

Useful Random Hints

Maven

Git

Git commands beyond the basics

Github is our Code Repository

Machine Room Info

Structuring Interaction Between Sheet Engine and Machine Components

Our Cluster

Using Hadoop

Using Pig

Interacting With WebBase Via a Browser

Sample Hadoop-Created Crawl Info Files

Useful Software

Page Processing Utilities

Accessing the Page Processing Utilities From Your Code

Distributing your Code