A Programming Primer for Counting and Other Unconventional Tasks

The Mechanize Gem

Automate the annoying parts of web-scraping

This chapter is just a stub and covers the basics of Mechanize version 2.x. A full example of how I use Mechanize can be found in the project where I scrape California's Common Surgeries database. Check out Mechanize's official tutorial, too.

The Mechanize gem gives us a high-level interface for all the concepts we've covered in the web-scraping chapters. I didn't introduce it right away because it's more important to have a basic understanding of how websites accept and return data to the browser, and Mechanize keeps most of those details hidden. But it's pretty damn useful. And besides saving us a lot of time by handling those details, it can navigate certain complex websites that I've had trouble with using lower-level programs.

This chapter assumes you've read the chapter on using Nokogiri for HTML parsing. Installing the nokogiri gem is a requirement before installing the mechanize gem.

Simplify the scraping of complex websites

We've seen how Nokogiri can make easy the parsing of HTML content. But sometimes, the challenge is navigating to the right page.

This is the case with many websites that require you to fill out a form. It's not enough to know the URL of a remote script and pass it parameters using the RestClient gem, as you would for a public API. The program that backs a website might perform a variety of checks – such as the existence and state of a browser cookie – before letting you submit a request. These kind of checks are handled invisibly through your browser, but the simple fetching scripts I've written so far don't.

This is where the Mechanize gem comes in. It leverages Nokogiri (or another parser of your choice) to parse a page for the relevant forms and buttons and provides a simplified interface for manipulating a webform:


require 'rubygems'
require 'mechanize'
require 'open-uri'

agent = Mechanize.new
agent.get("http://somepagewithaform.com")
        

The actual webpage can be accessed using the Mechanize instance's page method, which technically returns an instance of Mechanize::Page. To parse it as you would through Nokogiri, you call the Mechanize::Page instance's parser method. And then you can use the css method that we're all fond of.

So, to find the first form of the retrieved page, you do:


form = agent.page.parser.css('form')[0]
            

However, the Mechanize gem gives us a handy shortcut:


agent.page.forms[0]
            

To see a list of input fields:


agent.page.forms[0].fields
            

Let's say there are input fields with names of "first_name", "last_name", and "phone". This is how you set those field values using the Mechanize agent instance:


agent.page.forms[0]["first_name"] = "Dan"
agent.page.forms[0]["last_name"] = "Nguyen"
agent.page.forms[0]["phone"] = "5553526493"

                    
Form submission

This is how you submit the form:


agent.page.forms[0].submit
    

After the submit call finishes, agent.page will now contain the result of your submission. You don't even need to declare a separate variable to hold each page, as agent.page holds whatever page you're currently on.

Mechanize in action

This section is currently in the works. You can see how I use Mechanize to download pages from the California's Common Surgeries database and for the Putnam County jail archive.

For now, I've included the example on how to upload a file to Flickr, as copied from the mechanize homepage. I will do a writeup of how this works:


require 'rubygems'
require 'mechanize'

abort "#{$0} login passwd filename" if (ARGV.size != 3)

a = Mechanize.new { |agent|
  # Flickr refreshes after login
  agent.follow_meta_refresh = true
}

a.get('http://flickr.com/') do |home_page|
  signin_page = a.click(home_page.link_with(:text => /Sign In/))

  my_page = signin_page.form_with(:name => 'login_form') do |form|
    form.login  = ARGV[0]
    form.passwd = ARGV[1]
  end.submit

  # Click the upload link
  upload_page = a.click(my_page.link_with(:text => /Upload/))

  # We want the basic upload page.
  upload_page = a.click(upload_page.link_with(:text => /basic Uploader/))

  # Upload the file
  upload_page.form_with(:method => 'POST') do |upload_form|
    upload_form.file_uploads.first.file_name = ARGV[2]
  end.submit
end