Wednesday, February 07, 2007

Hpricot, screen_scrape, and horrible pages

I'm an avid Dungeons and Dragons player and have decided I'm going to make a Rails application that will help consolidate the rules for personal use.

So I decided to do a screen scrape http://www.wizards.com/default.asp?x=dnd/lists/feats . I wanted the list of feats. To get to the table, and each row of feats, here was the xPath I used:

"/html/body/table[2]/tr[2]/td[3]/table/tr[3]/td/table[4]/tr/td/table/tr"

Of particular note, I was having trouble with the tbody in xPath, so I removed those references. Let me tell you, without Firebug's copy xPath, I would've went insane getting this out.

So I ask, why the heck does Wizards of the Coast use table based layouts. They hurt my brain hurt.

If you are interested here is the code that I wrote. It doesn't have any underlying tests, but I ran the import without fail, so for now, its good enough for me.


class Feat < ActiveRecord::Base
validates_presence_of :name, :description
belongs_to :rule_source

class << self
# Run the import, first getting the RuleSources
def import!
File.open(File.dirname(__FILE__) + '/feats.html', 'w') do |f|
f.puts (open("http://www.wizards.com/default.asp?x=dnd/lists/feats").read)
end
doc = File.open(File.dirname(__FILE__) + '/feats.html') {|f| Hpricot(f) }

RuleSource.parse_rule_source(doc)

parse_feats(doc)
end
private
def parse_feats(doc)
feats = (doc/"/html/body/table[2]/tr[2]/td[3]/table/tr[3]/td/table[4]/tr/td/table/tr")
feats.pop # Get rid of the results count

feats.each_with_index do |feat, index|
parse_feat(feat) if index > 1
end
end
def parse_feat(feat)
... Do the actual parsing of the row ...
end
end
end

class RuleSource < ActiveRecord::Base
validates_presence_of :code, :name

class << self
def parse_rule_source(doc, publisher = nil)
(doc/"div.keyboxscroll/table/tr").each do |source|
if source_code = (source/"td[1]")
source_code = source_code.inner_html.strip
end
if source_name = (source/"td[2]/i")
source_name = source_name.inner_html.strip
end

find_or_create_by_code_and_name( source_code, source_name )
end
end
end
end

1 comment:

-Peter Blind said...

Ahhh the beauty of rails. I have read much but not tried it personally. I'm happy just using PHP4 to support my hobby. Seriously though, its not automated but copy/paste to notepad often does a reasonable job of cleaning up table based text. I was wondering why you don't use the hypertext d20 srd ?
Its invaluable to me: http://www.d20srd.org/