So I decided to do a screen scrape http://www.wizards.com/default.asp?x=dnd/lists/feats . I wanted the list of feats. To get to the table, and each row of feats, here was the xPath I used:
"/html/body/table[2]/tr[2]/td[3]/table/tr[3]/td/table[4]/tr/td/table/tr"
Of particular note, I was having trouble with the tbody in xPath, so I removed those references. Let me tell you, without Firebug's copy xPath, I would've went insane getting this out.
So I ask, why the heck does Wizards of the Coast use table based layouts. They hurt my brain hurt.
If you are interested here is the code that I wrote. It doesn't have any underlying tests, but I ran the import without fail, so for now, its good enough for me.
class Feat < ActiveRecord::Base
validates_presence_of :name, :description
belongs_to :rule_source
class << self
# Run the import, first getting the RuleSources
def import!
File.open(File.dirname(__FILE__) + '/feats.html', 'w') do |f|
f.puts (open("http://www.wizards.com/default.asp?x=dnd/lists/feats").read)
end
doc = File.open(File.dirname(__FILE__) + '/feats.html') {|f| Hpricot(f) }
RuleSource.parse_rule_source(doc)
parse_feats(doc)
end
private
def parse_feats(doc)
feats = (doc/"/html/body/table[2]/tr[2]/td[3]/table/tr[3]/td/table[4]/tr/td/table/tr")
feats.pop # Get rid of the results count
feats.each_with_index do |feat, index|
parse_feat(feat) if index > 1
end
end
def parse_feat(feat)
... Do the actual parsing of the row ...
end
end
end
class RuleSource < ActiveRecord::Base
validates_presence_of :code, :name
class << self
def parse_rule_source(doc, publisher = nil)
(doc/"div.keyboxscroll/table/tr").each do |source|
if source_code = (source/"td[1]")
source_code = source_code.inner_html.strip
end
if source_name = (source/"td[2]/i")
source_name = source_name.inner_html.strip
end
find_or_create_by_code_and_name( source_code, source_name )
end
end
end
end
1 comment:
Ahhh the beauty of rails. I have read much but not tried it personally. I'm happy just using PHP4 to support my hobby. Seriously though, its not automated but copy/paste to notepad often does a reasonable job of cleaning up table based text. I was wondering why you don't use the hypertext d20 srd ?
Its invaluable to me: http://www.d20srd.org/
Post a Comment