Here is a more detailed break down: I'd like to automate gathering contact information from businesses, initially building lists of businesses in various countries/regions by scanning yellow pages and directories. After the lists are built, then the business' websites need to be checked for their contact pages/contact info. Since it needs to work for different languages, the regex need to be easily configurable so the language words/phrases can be changed. The methods for sites will be different but thats why a customizable crawler is best. The first steps can all be the same, following links inside the directory site looking for pages with keywords like category, phone, address and checking if the words exist multiple times on a page or if groups of them exist. An example is YellowPages.com, the direcory has the name of the business' and their phone numbers but not easily recognizable entries for a bot which wasn't written for it's purpose. For a page like that it would try normal contact keywords and once it has failed it could look at the number of phone numbers occurring on the page and cut it up into phone number to phone number chunks (stripping extra html but leaving various tokens). Later that bulk data can be viewed and separated thru some custom methods. The accuracy wont' be as good as it would be if it was created for a specific site but the ability to tweak the regex' will allow it to be get more precision. Even items like finding which link goes to the next page in a directory can be customizable. Most directories have similar titles ('email' 'address' 'phone') so they'll be easier to lock on without too much personalization. Later it'll need an automated mailer to email the site's contact emails and read the responses to see if they're the approved email for normal contact. The other crawler is to crawl forums and build a knowledge base of information on various pre-determined subjects, looking for quality information over quantity. Here is the db code, CREATE TABLE directory ( id INT UNSIGNED NOT NULL AUTO_INCREMENT, description VARCHAR(128) NOT NULL DEFAULT '', url VARCHAR(512) NOT NULL DEFAULT '' COMMENT 'directory starting url', country CHAR(3) NOT NULL DEFAULT '' COMMENT 'country code', status VARCHAR(16) NOT NULL DEFAULT '' COMMENT 'uncrawled, started, stopped, etc..', dateCrawled TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'date it was last crawled', lastCrawledUrl VARCHAR(512) NOT NULL DEFAULT '' COMMENT 'Last url crawled if crawling stopped', PRIMARY KEY(id) ) COMMENT='Directories' DEFAULT CHARSET=utf8 ENGINE=INNODB; CREATE TABLE urlLog ( id INT UNSIGNED NOT NULL AUTO_INCREMENT, directory INT UNSIGNED NOT NULL DEFAULT 0 COMMENT 'the directory this url came from', url VARCHAR(512) NOT NULL DEFAULT '' COMMENT 'url', dateCrawled TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'date it was last crawled', title VARCHAR(64) NOT NULL DEFAULT '' COMMENT 'page title', pageNotFound boolean NOT NULL DEFAULT 0 COMMENT 'if page was not found on later check', comment VARCHAR(32) NOT NULL DEFAULT '' COMMENT 'comments on page', PRIMARY KEY(id), KEY (directory), FOREIGN KEY (directory) REFERENCES directory(id) ON UPDATE CASCADE ON DELETE RESTRICT ) COMMENT='List of scanned urls in directory' DEFAULT CHARSET=utf8 ENGINE=INNODB; CREATE TABLE listing ( id INT UNSIGNED NOT NULL AUTO_INCREMENT, urlLog INT UNSIGNED NOT NULL DEFAULT 0 COMMENT 'the id of the urlLog item this listing was found in', dateUpdated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, category VARCHAR(64) NOT NULL DEFAULT '' COMMENT 'examples: insurance, banks, real estate', `name` VARCHAR(255) NOT NULL DEFAULT '' COMMENT 'business name', email VARCHAR(1024) NOT NULL DEFAULT '' COMMENT 'email addresses', website VARCHAR(512) NOT NULL DEFAULT '' COMMENT 'web addresses', phone VARCHAR(128) NOT NULL DEFAULT '' COMMENT 'phone numbers, including extensions if existing', address VARCHAR(255) NOT NULL DEFAULT '' COMMENT 'business address', postalcode CHAR(10) NOT NULL DEFAULT '' COMMENT 'postal or zip code', city VARCHAR(100) NOT NULL DEFAULT '' COMMENT 'city or town', state VARCHAR(32) NOT NULL DEFAULT '' COMMENT 'state or province', country CHAR(35) NOT NULL DEFAULT '' COMMENT 'country name or country code', logo VARCHAR(160) NOT NULL DEFAULT '' COMMENT 'file name of downloaded logo image', rating VARCHAR (20) NOT NULL DEFAULT '' COMMENT 'examples: *****, 9/10, 4 out of 5 stars', customerEmail VARCHAR(255) NOT NULL DEFAULT '' COMMENT 'customer contact email address', customerName VARCHAR(96) NOT NULL DEFAULT '' COMMENT 'customer contact name', customerPhone VARCHAR(96) NOT NULL DEFAULT '' COMMENT 'customer contact number', customerUrl VARCHAR(512) NOT NULL DEFAULT '' COMMENT 'customer contact url', customerHasForm boolean NULL DEFAULT NULL COMMENT 'whether customer contact page uses a form instead', PRIMARY KEY(id), KEY (urlLog), FOREIGN KEY (urlLog) REFERENCES urlLog(id) ON UPDATE CASCADE ON DELETE RESTRICT ) COMMENT='Directory listings' DEFAULT CHARSET=utf8 ENGINE=INNODB; CREATE TABLE regex ( id INT UNSIGNED NOT NULL AUTO_INCREMENT, directory VARCHAR(128) NOT NULL DEFAULT '' COMMENT 'comma separated list of directory ids this regex is used in', description VARCHAR(128) NOT NULL DEFAULT '' COMMENT 'which item(s) this regex retrieves', regex VARCHAR(1024) NOT NULL DEFAULT '', PRIMARY KEY(id) ) COMMENT='Regular Expressions' DEFAULT CHARSET=utf8 ENGINE=INNODB; CREATE TABLE language ( id INT UNSIGNED NOT NULL AUTO_INCREMENT COMMENT 'index', language CHAR(3) NOT NULL COMMENT 'language code, eng, en, pa, de...', base VARCHAR(128) NOT NULL COMMENT 'the english word/phrase', alternate VARCHAR(128) NOT NULL COMMENT 'word/phrase in other language', PRIMARY KEY(id) ) COMMENT='Language word/phrase map for regular expressions' DEFAULT CHARSET=utf8 ENGINE=INNODB;