Parsing HTML with Nokogiri

Parsing with Nokogiri is a breeze, especially if you’re super familiar with CSS selectors and you’re good at traversing the DOM.

1. First, you need to install the Nokogiri gem:

gem install nokogiri

 

If you are parsing something that lives remotely, you might want to include the
open-uri module, which will make it even easier.

Depending on your app, remember to do the require ‘nokogiri’ and require ‘open-uri’

2. Now open the page you want to parse:

page = Nokogiri::HTML(open("http://www.upworthy.com/"))

 

3. Select the element you want.

There’s tons of ways of getting what you want. But the pretty basic thing is to do something like:

page.css('div.nugget.clickable.analytic_event')

 

This will give you the whole nugget of info. If you wanted the image urls, you’d do:

page.css('div.nugget-image a')

 

And there you go. If you know how to select elements, then you can get text/data, anything you want. Supposing you selected all the links in a page:

links = page.css(“a”)

What you’ll get is a Nokogiri Node object, which acts like an “array” of links. Therefore, you can call the methods:

links.length # => 6
links[0].text # => Click here
links[0]["href"] # => http://www.google.com

You can even iterate through them, as with this example:

links.each do |link| 
   link['href']
end

Of course there’s a lot more to it than just this, but you can do a lot of parsing with just the above and just by knowing how things flow in the DOM.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>