A Programming Primer for Counting and Other Unconventional Tasks

CA Surgeries: Parsing and Storing

Using Nokogiri to parse the collected HTML pages and storing into a SQLite database

In the first part of this project, we used Mechanize to collect surgery cost data for every California hospital, by county, category of procedure, and type of procedure. All together, we collected about 4,500 HTML files (which you can download here if you skipped the last part).

Now we parse the files for the data to store in a database. Since these files are raw HTML, we are performing a web-scrape and parsing the HTML for the exact fields that we want.

The resulting datafiles, in SQLite and plaintext, are here for download.

The Steps

This phase of the project involves the following steps:

  1. Parse the HTML – Use Nokogiri to read the HTML files and scrape the data from the HTML tables.
  2. Store the data – Save the data into a SQLite database to make them easier to query.

There's no reason that this parsing and storing phase couldn't be combined with the HTML-fetching script from the previous section. But now that all the HTML files are stored locally, we can test out parsing scripts on them and quickly see the results.

Parsing with Nokogiri

By now, examining a webpage's structure with a web inspector such as Firebug should be a familiar process to you.

Open any of the HTML files we pulled from the Common Surgeries database:

A datatable

Inspect the table element:

The table id

It looks like the table has an id of ctl00_ContentPlaceHolder1_gridHospitalList. The first table row element contains headers, the last one contains the number of hospitals in the table. Everything in between is actual data.

The CSS selector to capture those table row elements is:

#ctl00_ContentPlaceHolder1_gridHospitalList tr

Then it's just a matter of splitting each <tr> element into table cells and converting their contents to text:


page = Nokogiri::HTML(open(fname))
page.css("#ctl00_ContentPlaceHolder1_gridHospitalList tr")[1..-2].each do |tr|
   puts tr.css('td').map{ |td|  
      td.text.gsub(/[$,](?=\d)/, '').gsub(/\302\240|\s/, ' ').strip
   }.join(" | ")
end   
      
#=> ALVARADO HOSPITAL | 6655 ALVARADO ROAD, SAN DIEGO, 92120-5298 | 2 | 3.0 | 45308      
   

Those two gsub calls look more intimidating than they are. The first one uses a regular expression to convert currency amounts (e.g. $12,345) to numbers (e.g. 12345). The second one converts any kind of HTML whitespace – including tab and newline characters – into a normal space character, because we don't want to store line breaks in our flat text file.

Parsing the filenames

If you followed the previous part, the HTML files should be named in this format:

2007--Thyroid Procedures--Parathyroidectomy--Yolo.html   

i.e. YEAR--CATEGORY--PROCEDURE--COUNTY.html

In our database, we want to keep track of these properties. All it takes is splitting the filename by the delimiter "--". We can use the File.basename method to get the filename sans the .html extension and directory name:

year,category,procedure,county = File.basename(fname, '.html').split('--')

Looping through

Nothing clever here. Just collect the filenames from whatever directory you stored them in, loop through and parse them. Also, skip the files named "All Counties" so you don't double-count records.

Before we add the data to a database, we can store it all in a flat textfile.

Note: If you get an encoding error, try including #encoding: UTF-8 at the top of the file.


require 'rubygems'
require 'nokogiri'

TABLE_DIV_ID = "#ctl00_ContentPlaceHolder1_gridHospitalList"      
FIELD_NAMES = ['year', 'category', 'procedure', 'county', 'hospital', 'address','discharge_count', 'median_length_in_days', 'median_charge_per_stay']
OFILE = File.open('data-hold/data.txt', 'w')

OFILE.puts FIELD_NAMES.join("\t")

Dir.glob("data-hold/pages/*.html").reject{|f| f =~ /All Counties/}.each do |fname|
   meta_info = File.basename(fname, '.html').split('--')
   page = Nokogiri::HTML(open(fname))
   page.css("#{TABLE_DIV_ID} tr")[1..-2].each do |tr|
      data_tds = tr.css('td').map{ |td|  
         td.text.gsub(/[$,](?=\d)/, '').gsub(/\302\240|\s/, ' ').strip
      }

      OFILE.puts( (meta_info + data_tds).join("\t"))     
   end   
end

OFILE.close

      
   

Storing the Data with SQLite

If you haven't read the chapter on SQL and SQLite, review it now as I won't be explaining the basics here. You should be familiar with the basic syntax, including how to open a database and select rows.

Creating the table

In the previous code section, I defined the FIELD_NAMES constant which contains an array of the field names. It's used to print out the headers for the textfile and we can use it to generate the table schema:


TABLE_NAME = 'surgery_records'
q = "CREATE TABLE #{TABLE_NAME}(#{FIELD_NAMES.map{|f| "`#{f}`"}.join(', ')});"
puts q

That map method simply sets each field name to be enclosed in backticks (which is necessary if you have a field name like "A Hospital's Name" with non-alphanumeric characters)

CREATE TABLE surgery_records(`year`, `category`, `procedure`, `county`, `hospital`, `address`, `discharge_count`, `median_length_in_days`, `median_charge_per_stay`)
Specify column types

As with other variants of SQL, SQLite allows you to specify a datatype per column. It is helpful to specify that a column of data is supposed to be all numbers so that that column's data is treated as numbers when doing comparisons. Remember:

Numbers Strings
11 > 9 "11" < "9"

Turning FIELD_NAMES into a two-dimensional array allows us to modify the query string used to create the table in the following fashion:


FIELD_NAMES = [['year', 'NUMERIC'],['category', 'VARCHAR'], ['procedure', 'VARCHAR'],['county', 'VARCHAR'], ['hospital', 'VARCHAR'],  ['address', 'VARCHAR'], ['discharge_count', 'NUMERIC'], ['median_length_in_days', 'NUMERIC'], ['median_charge_per_stay', 'NUMERIC'] ]

q=  "CREATE TABLE #{TABLE_NAME}(#{FIELD_NAMES.map{|f| "`#{f[0]}` #{f[1]}"}.join(', ')});"

puts q
#=> CREATE TABLE surgery_records(`year` NUMERIC, `category` VARCHAR, `procedure` VARCHAR, `county` VARCHAR, `hospital` VARCHAR, `address` VARCHAR, `discharge_count` NUMERIC, `median_length_in_days` NUMERIC, `median_charge_per_stay` NUMERIC);
      
Create column indexes

For large datasets, you can greatly speed up your queries by indexing the columns that you intend to do "WHERE column_a = `dog`" conditions. For the Common Surgeries data, we will be aggregating numbers per procedure, county, and hospital, so it makes sense to index those.

Even though we will also SELECT rows based on specific year values. But since there are only 3 possible values – 2007, 2008, and 2009 – in this dataset, the performance gain from indexing would be minimal, if any.

Read more about indexing database tables here.

In the FIELD_NAMES array, we can add a third value per field that represents the We can add to. If that third value is present, we create an index for that field. If not, we don't create an index. The actual index name isn't important, nor will you need to remember it while doing queries. The SQLite engine will use the index automatically:


FIELD_NAMES = [['year', 'NUMERIC'],['category', 'VARCHAR'], ['procedure', 'VARCHAR', 'procedure_index' ],['county', 'VARCHAR', 'county_index'], ['hospital', 'VARCHAR', 'hospital_index'],  ['address', 'VARCHAR'], ['discharge_count', 'NUMERIC'], ['median_length_in_days', 'NUMERIC'], ['median_charge_per_stay', 'NUMERIC'] ]

FIELD_NAMES.each do |fn|
   puts "CREATE INDEX #{fn[2]} ON #{TABLE_NAME}(#{fn[0]})" unless fn[2].nil?
end         

#=> CREATE INDEX procedure_index ON surgery_records(procedure)
#=> CREATE INDEX county_index ON surgery_records(county)
#=> CREATE INDEX hospital_index ON surgery_records(hospital)
      
The INSERT statement

This is the INSERT statement for loading our data:


DB_INSERT_STATEMENT = "INSERT into #{TABLE_NAME} values (#{FIELD_NAMES.map{'?'}.join(',')})"
puts DB_INSERT_STATEMENT   #=>   "INSERT into surgery_records values (?,?,?,?,?,?,?,?,?)"         
      

Those question marks will be filled in with the values we specify once we actually execute the INSERT statement. How does SQLite know which value goes into which column? Since we don't specify anything, it simply inserts the values in sequential order, i.e. the value in the first ? will be inserted in the first column of the table.

As long as we specify the data in the same order of the table's column structure, we will be good to go. Luckily, we prepared in advance by creating the table columns with the FIELD_NAMES array. We use that array again when entering the actual data.

Providing the values for the INSERT statement is as simple as passing in an array of values to the SQLite instance's execute method:


DB.execute(DB_INSERT_STATEMENT, some_array_of_values)
   

Storage code, all together

I've made a few iterations on the code and its variables as we progressed in this chapter. Here's the final code; on my laptop, it completed the scraping and database loading in under a minute:

require 'rubygems'
require 'nokogiri'
require 'sqlite3'

FIELD_NAMES = [['year', 'NUMERIC'],['category', 'VARCHAR'], ['procedure', 'VARCHAR', 'procedure_index' ],['county', 'VARCHAR', 'county_index'], ['hospital', 'VARCHAR', 'hospital_index'],  ['address', 'VARCHAR'], ['discharge_count', 'NUMERIC'], ['median_length_in_days', 'NUMERIC'], ['median_charge_per_stay', 'NUMERIC'] ]

# Set up scrape and text file
TABLE_DIV_ID = "#ctl00_ContentPlaceHolder1_gridHospitalList"      
OFILE = File.open('data-hold/ca-common-surgeries.txt', 'w')
OFILE.puts( FIELD_NAMES.map{|f| f[0]}.join("\t") )

# Set up database; delete existing sqlite file
DBNAME = "data-hold/ca-common-surgeries.sqlite"
File.delete(DBNAME) if File.exists?DBNAME
DB = SQLite3::Database.new( DBNAME )

TABLE_NAME = "surgery_records"
DB_INSERT_STATEMENT = "INSERT into #{TABLE_NAME} values
  (#{FIELD_NAMES.map{'?'}.join(',')})"

# Create table
DB.execute "CREATE TABLE #{TABLE_NAME}(#{FIELD_NAMES.map{|f| "`#{f[0]}` #{f[1]}"}.join(', ')});"
FIELD_NAMES.each do |fn| 
  DB.execute "CREATE INDEX #{fn[2]} ON #{TABLE_NAME}(#{fn[0]})" unless fn[2].nil?
end


Dir.glob("data-hold/pages/*.html").reject{|f| f =~ /All Counties/}.each do |fname|
   meta_info = File.basename(fname, '.html').split('--')
   page = Nokogiri::HTML(open(fname))
   
   page.css("#{TABLE_DIV_ID} tr")[1..-2].each do |tr|
      data_tds = tr.css('td').map{ |td|  
         td.text.gsub(/[$,](?=\d)/, '').gsub(/\302\240|\s/, ' ').strip
      }
    data_row = meta_info + data_tds
      OFILE.puts( data_row.join("\t"))  
      DB.execute(DB_INSERT_STATEMENT, data_row)

   end   
end

OFILE.close


Code Comparison: SQL vs. Ruby

Was it worth all that work to create a SQLite database for records that are easily stored in a textfile? For this relatively small set (20,000+ rows), it may come down to a matter of preference. But for datasets with rows numbering in the hundreds of thousands or even millions, a database is probably the only way to efficiently handle the data.

Learning SQL – especially if you're just learning programming, period – is intimidating. But knowing how to construct even simple statements can save you time and frustration later on. SQL isn't the only database solution, but the basics of how database software meshes with scripting code is the same across SQL and non-SQL solutions.

The following section contains a few sample queries on the SQLite database compared to the pure Ruby code needed to get the same results with a textfile loaded as an array of tab-delimited rows.

Note: I've added line breaks so that more code is visible within the narrow table cells.

SQL database Ruby collection
Setup
DBNAME = "data-hold/ca-common-surgeries.sqlite"
DB = SQLite3::Database.open( DBNAME )
TABLE_NAME = 'surgeries'
TEXT_FILE = File.open(
   'data-hold/csurgeries-compiled.txt', 'r'){ |f| 
   f.readlines }.map{ |line| line.split("\t") 
}[1..-1]
Count number of rows

rowcount = DB.execute("SELECT COUNT(1) FROM
 #{TABLE_NAME}")[0]
  

rowcount =  TEXT_FILE.length
   
Total number of discharges

DB.execute("SELECT SUM(discharge_count) 
   FROM #{TABLE_NAME}")[0]
  

TEXT_FILE.inject(0){ |sum, line| 
   sum += line[6].to_i }
   
Total number of records by procedure

DB.execute("SELECT procedure, 
   COUNT(1) FROM #{TABLE_NAME} 
   GROUP BY procedure")     

This returns an array of arrays:

[['PROCEDURE_NAME', 'PROCEDURE_COUNT]]

TEXT_FILE.inject({}) do |hash, line| 
   hash[line[2]] ||= 0
   hash[line[2]] += 1
   hash
end

This creates a hash with the procedure names as the index:

hash{'PROCEDURE_NAME'} = PROCEDURE_COUNT
Select and order rows, with conditions

The most common queries will involve gathering records that meet certain conditions, such as having a county of "Los Angeles" and a year of "2008". And in some cases, you'll want them ordered by a certain value.


DB.execute("SELECT * FROM #{TABLE_NAME} 
           WHERE county='Los Angeles' 
               AND year=2008
            ORDER by median_charge_per_stay DESC") 

TEXT_FILE.select{ |row| 
      row[0]=='2008' && row[3]=='Los Angeles' 
      }.sort{ |x,y| 
         y[8].to_i <=> x[8].to_i
      }
Considering SQL versus Ruby collections

To a SQL novice, the SQLite query reads like plain – albeit, somewhat-broken English – and is easy to comprehend even if you knew nothing about this particular dataset. In the text-array version, you have to decipher several boolean statements and a sequence of methods.

So is the extra work of learning SQL and creating a database worth it? As in most things about programming, it's a pain at first, but saves you much greater pain in the future. Besides writing a chapter on SQL, I use SQL for most of the data-intensive projects included in this book.

If you had trouble creating the database from the last chapter, you can download a copy of the data I compiled here (includes both SQLite and plaintext).