A Programming Primer for Counting and Other Unconventional Tasks

CA Surgeries: Mechanize

Use Mechanize to navigate and scrape the Common Surgeries Database

One of the reasons why I chose the Common Surgeries data is because it's easy – though not straightforward – to scrape. In fact, it stumped me for an hour because I tried to do it without using the Mechanize gem. But once I decided, "Screw it, I'm just going to use the Mechanize gem," it was about 5 minutes to write the scraping code.

It doesn't appear that there's a link to download all the Common Surgeries data at once. But because it's public data, a public records request would probably do the trick. But why wait 3-5 days when you could get it all during your lunch break?

This decision depends on the website and the reticence of whoever's in charge of handling data requests. Sometimes a public records request by e-mail the fastest and most thorough way to get data. Sometimes it's just not possible to out-code a public facing website.

Experience will give you insight on what's the better return on investment. Anyway, this is meant to be practice.

Trying out the Common Surgeries website

The first thing you see when visiting www.oshpd.ca.gov/commonsurgery/ is five drop-down select menus for: Discharge Year, Procedure Category, Procedure, County and City

Clicking on each drop-down menu reveals the left-to-right process needed to actually get the data: when you select an option, the next drop-down menu populates with options. Selecting 2008 for the first drop-down populates the Procedure Category drop-down. Selecting "Obsterical Procedures" as a category populates the Procedure drop-down with just two options: "Assisted Vaginal Delivery (all)" and "C-Section Delivery".

After selecting a Procedure and clicking Go the website finally displays some data: a table of all the hospitals who have reported performing the selected procedure for the selected year.

Clicking on Go before selecting a procedure brings up nothing. Specifying a County and/or City only limits the results shown.

As a web developer, I can understand why this site is structured as is. It prevents the loading of unnecessary options in the drop-downs (if the user is interested in Sacramento County, no use in populating the City field with "Los Angeles") and the table design is simplified, as there's no need for year, category, and procedure columns since those will have already been specified.

But this website makes it very inconvenient to do simple comparisons beyond looking at individual procedures among hospitals within a given year. For example, selecting a different Year basically resets the selection process.

To find all records for a given year and procedure, it requires seven clicks (two per each of the three drop-downs, plus the final Go). Nevermind the annoying part of highlighting the table to copy and paste it into Excel (and then whatever you have to do to remove the formatting).

So, to collect all of the three years of data for 37 procedures, that's 777 clicks – plus about 5 to 10 seconds to wait for the page to load – not counting the most time consuming step: the accurate copying-and-pasting of the data tables.

If you're the type of professional who can afford an intern to do this, bravo. But the problem isn't time, it's accuracy. How can you tell whether one table of data didn't result from a misplaced click? The website's data table itself doesn't include this information. So, when you account for the time to do the tedious double-checking, a simple data-fetching task now becomes a considerable source of frustration.

The only sane way to get all of this data is through a public records request. Or to scrape the site. Here's how to do the latter.

Inspecting the Common Surgeries website

Get out your favorite web inspector – I typically use Google Chrome or Firefox w/ Firebug – for this next step. Take a look-see of my chapter on web inspectors, if that last sentence made no sense to you.

In order to dynamically populate the drop-down fields with each user select-action, the surgeries site uses Javascript to detect the current state of the drop-down menus. This is sent to the server-side script, which then determines what drop-down options (i.e. the "C-Section" procedure option if the "Obsterical" category is the active selection) or hospital records to update the webpage with.

(This updating of specific page elements without reloading the entire page is AJAX in action, if you're unfamiliar with what that actually means.)

Open up your web inspector to its Network panel. Then reload the Common Surgeries page.

When the entire page reloads, the web inspector will show all the requests needed to load each of the page's assets, including images, external javascript files, and the page itself. Now select one of the Year options and you should see a new asset being loaded:

That POST request for a file called "default.aspx" is how the web server and webpage communicates data requests/transfers. If you examine the headers for that request, you'll see a lot of technical data. OK, what to make of this? Well, we just made a selection for Year, so look for anything that might have something to do with that:

Post details

So it appears that the following parameters are affected or altered by our drop-down select-action:

  • __EVENTTARGET
  • ctl00%24ContentPlaceHolder1%24ddlDischargeYear

Select one of the categories. In the web inspector's Network Panel (which we covered in the traffic inspection chapter), you should see a new request for default.aspx. Inspect the headers and you should see this:

The pattern seems clear: every time the user selects an item from a drop-down, the webpage submits the default.aspx script with parameters that correspond to that item. View the source of the Common Surgery page and look for the <form> tags. I've highlighted the relevant parts below:

If you aren't familiar with how HTML forms typically work, the various input fields have name and value attributes. The former contains the name of the parameter sent to the server to process; the latter contains the value of that parameter, which will be whatever you input or selected.

The drop-down corresponding to Year has a name of ctl00$ContentPlaceHolder1$ddlDischargeYear.

So, by selecting a year, we initiated a Javascript call that submits default.aspx containing, among other data, the parameter/value pair: ctl00$ContentPlaceHolder1$ddlDischargeYear=2008

RestClient or Mechanize?

It should be possible to simulate a POST request for default.aspx with RestClient, by setting all the parameters (e.g. ctl00$ContentPlaceHolder1$ddlDischargeYear) to the desired values. This doesn't quite work.

We don't know exactly how the Common Surgeries server-side script is configured. But it is apparently looking for more validation than just setting the form parameters. It could be looking at the user's cookie, for example. Or, it could be looking at the state of the form's hidden input fields:

It's probably not too hard to play around and figure what variables we'd have to set before the backend script accepts our programmatic call and responds with results as it would to a web browser. But in the name of brevity, it's much easier just to use the Mechanize gem.

I've started writing a tutorial on Mechanize but it's not complete enough to be of much additional help right now.

The Code

This section of code involves these steps:

  1. Submitting the forms with Mechanize – Loop through every option of every select-drop-down and use Mechanize to submit the form requests until the website returns actual data.
  2. Downloading the HTML of each request - Just download the webpage results as raw HTML. We'll parse them in the next part of this project.

The initial strategy will be to use Mechanize to select a year, then a category, and then a procedure. According to our previous exploration, the Common Surgeries should return a list of all hospitals that performed the given procedure for that year. We save this list to our hard drive as HTML.

The Mechanize agent then moves on to the next procedure and when it reaches the end, it goes back to the list of category options, picks a new category, and then repeats its iteration through the available procedures.

After all the categories and procedures have been explored, the Mechanize agent returns to the years drop-down, goes to the next year, and the whole process begins again. When the final year has been iterated through, we'll have saved all the data we need.

Submit with Mechanize

Let's start off by setting up the environment and the constants, which include the url of the Common Surgeries site and the name of the relevant drop-down fields (which you can find with your web inspector):


require 'rubygems'
require 'mechanize'

HOME_URL = "http://www.oshpd.ca.gov/commonsurgery/"
SELECT_FIELD_NAMES = {
  'year'=>'ctl00$ContentPlaceHolder1$ddlDischargeYear',
  'category'=>'ctl00$ContentPlaceHolder1$ddlProcedureCategory',
  'procedure'=>'ctl00$ContentPlaceHolder1$ddlProcedure'}

    

I've made the drop-down field names a hash for better readability; SELECT_FIELD_NAMES['year'] is easier to remember as a reference to the Year field than SELECT_FIELD_NAMES[0]

Initialize Mechanize

Now create a Mechanize agent and use its get method to access HOME_URL:


mech = Mechanize.new
mech.get(HOME_URL)
    

That get method retrieves the page at the given url. The page can be accessed through the page method. If you want the page as a Nokogiri-wrapped object so that you can parse it with the Nokogiri methods, you simply call the parser method on the Mechanize::Page instance that page returns.

Confusing? It's easier to show than explain:


puts mech.page.class    
#=> Mechanize::Page 

puts mech.page.parser.class 
#=> Nokogiri::HTML::Document

puts mech.page.parser.css("div.title").text 
#=> Common Surgeries and Charges Comparison        
    
Getting the HTML form from a page

The Mechanize::Page instance has helpful methods to pick out the relevant form and fields. The Mechanize::Page forms method returns an array of all form elements on the page. This is equivalent to doing css('form') with a Nokogiri-wrapped page object.

As it turns out, there's only one form on the webpage, so mech.page.forms[0] gives us the appropriate form. But if there were more than one form, we could identify the one we want by its action attribute. Typically, a form's action attribute is the script it submits to the server.

As we learned through web inspection, the name of the script is default.aspx. We can use the following Mechanize method to target the form with an action of that name:


form = mech.page.form_with(:action=>/default.aspx/)        
    

The form_with method accepts a regular expression as the argument. This saves us the time of having to check if the form's action is just default.aspx or the full absolute path, http://www.oshpd.ca.gov/commonsurgery/default.aspx

Getting the fields of a form

The form variable holds an instance of Mechanize::Form, another Mechanize construct with some helpful methods to handle HTML forms. Here's how to select fields, select a field by name, and retrieve a list of options:


puts form.fields.map{|f| f.name}.join("\n")
#=> __VIEWSTATE
#=> __EVENTVALIDATION
#=> ctl00$ContentPlaceHolder1$ddlDischargeYear
#=> ctl00$ContentPlaceHolder1$ddlProcedureCategory
#=> ctl00$ContentPlaceHolder1$ddlProcedure
#=> ctl00$ContentPlaceHolder1$ddlCounty
#=> ctl00$ContentPlaceHolder1$ddlCity
    
year_field = form.field_with(:name=>SELECT_FIELD_NAMES['year']) # i.e. ctl00$ContentPlaceHolder1$ddlDischargeYear
puts year_field.options.map{|o| o.value}.join(", ")
#=> [Select Year], 2009, 2008, 2007    
    
Setting a field value

I want to re-emphasize how much of Mechanize functionality is convenience methods. That is, you can get the same results without it, just with more/different coding. After all, we're dealing with HTML, which can be parsed by the Nokogiri library alone. For example, this is how you would pull the list of options for the Year field with just the Nokogiri library:


optionvals = nokogiri_page.css("select[@name='#{SELECT_FIELD_NAMES['procedure']}'] option").map{|o| o.value}
    

It's a matter of preference. I'm going to stick to using Mechanize's library for consistency's sake.

Setting a field value

So now that we know how to get to a drop-down field and read its possible options, how do we set the value we want? Mechanize makes this really easy:


form = mech.page.form_with(:action=>/default.aspx/)        
form[SELECT_FIELD_NAMES['year']] = 2008        
    

That's all there's to it. The Mechanize::Form instance lets you access its fields through hash-like notation and allows you to both read and set their values.

Submitting a form

After setting the form to the values we want, we use Mechanize's various submit methods to submit our request. It will be just like clicking the "Go" button in our browser.

On the Common Surgeries website, there's a button labeled "Go". This is how you "click" it with Mechanize:

form.submit(form.button_with(:value=>'Go'))

With this method, the form is submitted and mech will grab the HTML page that is sent as a result, as if you had made a call to the get method. The HTML page will be stored in mech.page.

If you inspect the contents of mech.page, you'll see that it reflects the state of having the Year field set.

It's hard to see the difference between the page that we submitted and what we got back (it depends how good you are at reading raw HTML). The Mechanize instance (mech) has a nice history method which returns an array of the pages it has so-far visited.

At initialization, history returns an empty array. After the first get call:


mech.get(HOME_URL)
puts mech.history.map{|h| h.uri}.join(', ')    #=> http://www.oshpd.ca.gov/commonsurgery/
        
    

After we submit the form, we see that history reflects a visit to default.aspx:


form.submit(form.button_with(:value=>'Go'))
puts mech.history.map{|h| h.uri}.join(', ')     
#=> http://www.oshpd.ca.gov/commonsurgery/, http://www.oshpd.ca.gov/commonsurgery/default.aspx

        

With the Year field set, we're able to move on to the next step, which is to set the Category field, and so on.

Mechanizing: All together now

Here are all of the steps from creating the Mechanize instance to submitting the form. And remember, we don't want to submit the form just once, but submit a form for each option of each drop-down menu:


mech = Mechanize.new
mech.get(HOME_URL)            
form = mech.page.form_with(:action=>/default.aspx/)        
year_field = form.field_with(:name=>SELECT_FIELD_NAMES['year'])
year_field.options[1..-1].each do |year_option|
    form[SELECT_FIELD_NAMES['year']] = year_option.value        
    form.submit(form.button_with(:value=>'Go'))
    
    
    # ... on to the next drop-down menu
end
# exit loop after all years have been visited
        

Condensing this down gets us:


mech = Mechanize.new
mech.get(HOME_URL)
form = mech.page.form_with(:action=>/default.aspx/)            
form.field_with(:name=>SELECT_FIELD_NAMES['year']).options[1..-1].each do |year_opt|
    form[SELECT_FIELD_NAMES['year']] = year_opt.value        
    form.submit(form.button_with(:value=>'Go'))
end

    

Note that I'm skipping the first option of the drop-down field with the reference to options[1..-1]. As you can see from looking at the webpage, the first option of each drop-down menu is "Select ...", not an actual value.

Mechanizing, with all the loops

OK, the above code iterates just through the first drop-down menu. To go through the next two drop-down menus, we basically repeat the same action inside a new loop and with a reference to the appropriate drop-down field name:


## beginning of loop for Year drop-down
form.field_with(:name=>SELECT_FIELD_NAMES['year']).options[1..-1].each do |yr_opt|
  form[SELECT_FIELD_NAMES['year']] = yr_opt.value
  form.submit(form.button_with(:value=>'Go'))
  puts "Year #{yr_opt.value}: #{mech.page.parser.css('tr').length}"


  ## beginning of loop for Category drop-down
  form = mech.page.form_with(:action=>/default.aspx/) 
  form.field_with(:name=>SELECT_FIELD_NAMES['category']).options[1..-1].each do |cat_opt|
    form[SELECT_FIELD_NAMES['category']] = cat_opt.value
    form.submit(form.button_with(:value=>'Go'))
    puts "\tCategory #{cat_opt.value}: #{mech.page.parser.css('tr').length}"

    ## beginning of loop for Procedure drop-down
    form = mech.page.form_with(:action=>/default.aspx/) 
    form.field_with(:name=>SELECT_FIELD_NAMES['procedure']).options[1..-1].each do |proc_opt|
        form[SELECT_FIELD_NAMES['procedure']] = proc_opt.value
        form.submit(form.button_with(:value=>'Go'))
        puts "\t\tProcedure #{proc_opt.value}: #{mech.page.parser.css('tr').length}"
    end
  end
end
        
    

I've thrown in a puts statement after each form.submit action that indicates the current drop-down menu being manipulated, its current value, and how many table rows are in mech.page. The output from the above script is:

Year 2009: 8
   Category Digestive Procedures: 8
      Procedure Colectomy Large Intestine (laparoscopic): 254
      Procedure Colectomy Large Intestine (open): 318
      Procedure Gallbladder Removal (laparoscopic): 318
      Procedure Gallbladder Removal (open): 264
      Procedure Gastric Bypass (laparoscopic): 95
      Procedure Gastric Bypass (open): 113
      Procedure Inguinal Hernia Repairs (laparoscopic): 100
      Procedure Inguinal Hernia Repairs (open): 249
      Procedure Lap Band (insertion of adjustable gastric band and port): 87
   Category Female System Procedures: 8
      Procedure Hysterectomy - Abdominal (laparoscopic): 224
      Procedure Hysterectomy - Abdominal (open): 312
      Procedure Hysterectomy - Vaginal (laparoscopic): 245
      Procedure Hysterectomy - Vaginal (open): 290
      Procedure Mastectomy - (bilateral): 193
      Procedure Mastectomy - (unilateral): 284
   Category Heart and Circulatory Procedures: 8
      Procedure CABG (Coronary Artery Bypass Graft): 126
      Procedure Endarterectomy (Removal of plaque from blood vessels - Head and Neck): 229
      Procedure Heart Valve Replacement (Aortic or Mitral): 132
      
       ...

As expected, no data gets returned when only Year and Category are set (the default webpage always has 8 table rows, apparently). Once Procedure has been set, then we start seeing some results.

Add in County

If you make a selection in the County drop-down field, you end up limiting the results of the larger query to a single county. However, a hospital's county may be useful when trying to do compare hospitals of a general geographical region that share a county jurisdiction's resources. Since county isn't a column in the results set, and it's not straightforward to figure out a hospital's county, the County drop-down gives us a way to know which hospitals belong to which county.

Adding County to the mix would mean traversing the drop-down once for each of the 53 counties. There are 37 different procedures and 3 different years. Thus, to get all of the data broken down by county would mean hitting the website as many as (3 * 37 * 53 =) 5,883 times.

Well, good thing we're not doing this by hand. For the script we've written so far, all it takes is adding one more inner loop and iterating through the County drop-down options.

Downloading the HTML of each request

This step is simple: Just write the contents of mech.page into a local file.

We're only collecting the HTML, not parsing it. We're saving the entire pages to local storage so that we can parse through them in the next step. Since the result datatables don't include the values for the selected drop-down menus, we should decide a standardized file format that indicates what year,category,procedure, and county values were used to get each dataset.

I simply went with listing the values in the filename with double hypens as a delimiter:

YEAR--CATEGORY--PROCEDURE--COUNTY.html

When we work with the files later, we can retrieve the drop-down values with something like:


year,category,procedure,county = filename.split('--')
      

The Fetching Code: All Together

Here's the final code to retrieve the data from the Common Surgeries website and should net you 4,400+ html files. It's not particularly robust; if the website chokes and fails to respond, the script will probably crash. You're free to write your own exception-handling code to deal with this.

require 'rubygems'
require 'mechanize'

HOME_URL = "http://www.oshpd.ca.gov/commonsurgery/"

SELECT_FIELD_NAMES = {
  'year'=>'ctl00$ContentPlaceHolder1$ddlDischargeYear',
  'category'=>'ctl00$ContentPlaceHolder1$ddlProcedureCategory',
  'procedure'=>'ctl00$ContentPlaceHolder1$ddlProcedure',
  'county'=>'ctl00$ContentPlaceHolder1$ddlCounty'  
}


mech = Mechanize.new
mech.get(HOME_URL)
form = mech.page.form_with(:action=>/default.aspx/) 

## beginning of loop for Year dropdown
form.field_with(:name=>SELECT_FIELD_NAMES['year']).options[1..-1].each do |yr_opt|
  form[SELECT_FIELD_NAMES['year']] = yr_opt.value
  form.submit(form.button_with(:value=>'Go'))
  puts "Year #{yr_opt.value}: #{mech.page.parser.css('tr').length}"

  
  ## beginning of loop for Category dropdown
  form = mech.page.form_with(:action=>/default.aspx/) 
  form.field_with(:name=>SELECT_FIELD_NAMES['category']).options[1..-1].each do |cat_opt|
    form[SELECT_FIELD_NAMES['category']] = cat_opt.value
    form.submit(form.button_with(:value=>'Go'))
    puts "\tCategory #{cat_opt.value}: #{mech.page.parser.css('tr').length}"

    ## beginning of loop for Procedure dropdown
    form = mech.page.form_with(:action=>/default.aspx/) 
    form.field_with(:name=>SELECT_FIELD_NAMES['procedure']).options[1..-1].each do |proc_opt|
        form[SELECT_FIELD_NAMES['procedure']] = proc_opt.value
        form.submit(form.button_with(:value=>'Go'))
        puts "\t\tProcedure #{proc_opt.value}: #{mech.page.parser.css('tr').length}"
        
        ## beginning of loop for County dropdown
        form = mech.page.form_with(:action=>/default.aspx/) 
        form.field_with(:name=>SELECT_FIELD_NAMES['county']).options[1..-1].each do |county_opt|
            form[SELECT_FIELD_NAMES['county']] = county_opt.value
            form.submit(form.button_with(:value=>'Go'))
            puts "\t\tProcedure #{proc_opt.value}, #{county_opt.value}: #{mech.page.parser.css('tr').length}"
            fname = "data-hold/pages/#{yr_opt.value}--#{cat_opt.value}--#{proc_opt.value}--#{county_opt.value}.html"
            File.open(fname, 'w'){|f| f.puts mech.page.parser.to_html}
        end
        
    end
  end
end
        

I've decided to do a version that has exception-handling. It's not a sterling example because I would've written the script entirely differently if I knew it would be an issue. Errors from accessing a website are always an issue, of course, but the CA.gov website is pretty robust.

If you've never written a begin/rescue/else construct, this shows the basics. I've written a method that wraps up the exception handling for the Mechanize::Form.submit call, allowing it to retry 3 times before failing entirely.

Again, not a great example. I may revise it an a future edition of this chapter. You can also read the separate chapter on exception-handling

require 'rubygems'
require 'mechanize'
require 'fileutils' # for creating the directory
DIR = 'data-hold/pages'
FileUtils.makedirs(DIR)


HOME_URL = "http://www.oshpd.ca.gov/commonsurgery/"

SELECT_FIELD_NAMES = {
  'year'=>'ctl00$ContentPlaceHolder1$ddlDischargeYear',
  'category'=>'ctl00$ContentPlaceHolder1$ddlProcedureCategory',
  'procedure'=>'ctl00$ContentPlaceHolder1$ddlProcedure',
  'county'=>'ctl00$ContentPlaceHolder1$ddlCounty'  
}

def form_submit_w_exception_handling(frm)
  retries = 3
  begin
    frm.submit(frm.button_with(:value=>'Go'))
  rescue Exception=>e
    puts "Problem: #{e}"
    if retries < 0 
      retries -= 1
      puts "Sleeping...#{retries} left"
      retry
    else
      raise "Unexpected, repeated errors. Shutting down"
    end
  else
    return frm
  end  
end


mech = Mechanize.new

begin
  mech.get(HOME_URL)
rescue
  raise "Couldn't get homepage"
else
  form = mech.page.form_with(:action=>/default.aspx/) 
end


  
## beginning of loop for Year dropdown
form.field_with(:name=>SELECT_FIELD_NAMES['year']).options[1..-1].each do |yr_opt|
  form[SELECT_FIELD_NAMES['year']] = yr_opt.value
  #form.submit(form.button_with(:value=>'Go'))
  form = form_submit_w_exception_handling(form)
  puts "Year #{yr_opt.value}: #{mech.page.parser.css('tr').length}"

  
  ## beginning of loop for Category dropdown
  form = mech.page.form_with(:action=>/default.aspx/) 
  form.field_with(:name=>SELECT_FIELD_NAMES['category']).options[1..-1].each do |cat_opt|
    form[SELECT_FIELD_NAMES['category']] = cat_opt.value
    form = form_submit_w_exception_handling(form)
    puts "\tCategory #{cat_opt.value}: #{mech.page.parser.css('tr').length}"

    ## beginning of loop for Procedure dropdown
    form = mech.page.form_with(:action=>/default.aspx/) 
    form.field_with(:name=>SELECT_FIELD_NAMES['procedure']).options[1..-1].each do |proc_opt|
        form[SELECT_FIELD_NAMES['procedure']] = proc_opt.value
        form = form_submit_w_exception_handling(form)
        puts "\t\tProcedure #{proc_opt.value}: #{mech.page.parser.css('tr').length}"
        
        ## beginning of loop for County dropdown
        form = mech.page.form_with(:action=>/default.aspx/) 
        form.field_with(:name=>SELECT_FIELD_NAMES['county']).options[1..-1].each do |county_opt|
            form[SELECT_FIELD_NAMES['county']] = county_opt.value
            form = form_submit_w_exception_handling(form)
            puts "\t\tProcedure #{proc_opt.value}, #{county_opt.value}: #{mech.page.parser.css('tr').length}"
            fname = "#{DIR}/#{yr_opt.value}--#{cat_opt.value}--#{proc_opt.value}--#{county_opt.value}.html"
            File.open(fname, 'w'){|f| f.puts mech.page.parser.to_html}
        end
        
    end
  end
end
        
The next step: parsing

This was one of the more annoyingly complicated kinds of public websites to navigate, but hopefully you got a high-level overview of how to figure out a website's eccentricities and workflow. It's possible to combine the scraping and parsing steps into a single script, of course. But for this introductory project, it's more convenient to keep copies of the raw HTML and parse it at our leisure.

If you had trouble downloading the data, I've created a zip file of all the files needed for the next step: parsing the data (i.e. scraping the HTML) and storing into a database.