Scraping the Web with Nodejs

clock icon Feb 21, 2019

Sometime last week, a colleague was trying to grab some data off a website to use in an app he was working on. It got me interested in the idea of web scraping as it’s not something I have done before. Here are my thoughts on the subject and my simple implementation.



The website I scraped was last.fm. They actually have a developer API for performing some useful actions like getting a user’s library, matching similar artists, etc. What I am interested in is getting a list of upcoming albums. They have a page that’s updated daily with upcoming albums of various artists and this is the data I need. You can bookmark albums you are interested in and get notified on the release date. You can search for artists you are interested in and receive notifications when an album from them becomes available.


Setup

I’m using nodejs for the backend so to follow along, install nodejs if you don’t have it already. The source code is on Github if you want to follow along. Create a new directory and switch to it in your terminal. Run the following commands to get the packages we will be using.


mkdir rest-api-tutorial
cd rest-api-tutorial
npm init
npm install --save nodemon cors express request body-parser


Create a Simple Server

Let’s set up a server to start receiving requests. Create a new file in the root of your project called app.js and add the code snippets below:

const express = require('express');
const bodyParser = require('body-parser');
const request = require('request');
const cheerio = require('cheerio');
const cors = require('cors');
const app = express();
const PORT = process.env.PORT || 3000;

app.use(bodyParser.urlencoded({ extended: true }));
app.use(bodyParser.json());
app.use(cors());
app.get(‘/last-fm’, (req, res) => {
...
});
app.listen(PORT, (err) => {
if (err) { return console.log('something bad happened', err);  }
console.log(`server is listening on ${PORT}`);
});

We import the packages installed in the previous step. We create an express app and set some configurations on it using app.use(). The server is going to accept requests from different origins hence the cors() package. We create a route (‘/last-fm’), which doesn’t do anything at the moment, but will hold the logic for our scrapper. Then we listen on the port specified earlier.


Switch to the terminal and run npm start to start the server. Then navigate to the browser localhost:3000/last-fm. Nothing happens so let’s add some logic to actually do something useful.


Scraping the website

Let's think of what we need to do for a second. We are going to make a request to the webpage to get the contents. Then we need to filter it for the information that we actually need. What do we need? The album name, cover art, artist and the release date. Here is where my laziness plays a crucial role.


The thought of writing the code to filter the text from the webpage was a big pain. After doing some research (high quality googling and combing through stakeoverflow), I found a package which is basically server side version of jquery. I don’t know if it does everything possible with jquery but it works the same way and give us what we need to build this scraper. Add this to the /last-fm route.

const URL = 'https://www.last.fm/music/+releases/coming-soon/popular';
const titles = [];
const coverArts = [];
const artists = [];
const releaseDates = [];
const comingSoon = [];

We declare variables that would hold the information we crawl from the page. The URL is the webpage we want to scrape. Let’s make a request to the server hosting the webpage.

request(URL, (error, response, html) => {
...
});

The request takes in 2 arguments, a URL and a callback function. The callback function is populated with 3 parameters, which are pretty self-explanatory. The html parameter contains the HTML of the page. The response contains the HTML and other properties like the status code.


Let’s pause at this point and look at the structure of the html we are interested in.

<ol>
<li>
<div>
<h3>
<a class=”link-block-target”></a>
</h3>
<div>
<span>
<img class=”cover-art”></span>
</div>
<p class=”album-grid-item-artist”></p>
<p class=”album-grid-item-date”></p>
</div>
</li>
</ol>

This is the stripped down structure of the section that contains the details we need. I’ve also highlighted the classes we would use to select the element with jquery. Inside the request callback function, add this.

if (error) res.json({ message: 'An error occured when fetching the page' });

const $ = cheerio.load(html);
$('ol').first().find('a.link-block-target').filter(function() {
const data = $(this).text();
title.push(data);
});

The first line checks to see if there was an error and return an appropriate response. If the request was successful, we use cheerio (server-side jquery) to load the page. There is more than one ol tag on the page but it’s the first one that is needed. The title is contained in an a tag with a class .link-block-target. We find all the elements with the class and append their text content to the title array. If you are familiar with jquery, this would look very familiar. We basically repeat the same thing for the remaining details.

$('ol').first().find('img.cover-art').filter(function() {
const data = $(this).attr('src');
coverArt.push(data);
});
$('ol').first().find('p.album-grid-item-artist').filter(function() {
const data = $(this).text();
const filteredData = data.replace('\n', '').trim();
artist.push(filteredData);
});
$('ol').first().find('p.album-grid-item-date').filter(function() {
const data = $(this).text();
const filteredData = data.replace('\n', '').trim();
releaseDates.push(filteredData);
});

Unlike the title, artist name, and release dates, to get the album art we get the src of the img tag. I just select the element and get the src attribute with the jquery attr() method.

for (let i = 0; i < title.length; i++) {
let obj = {};
obj.title = title[i];
obj.coverArt = coverArt[i];
obj.artist = artist[i];
obj.releaseDate = releaseDates[i];
comingSoon.push(obj);
}
res.json(comingSoon);

I loop through the arrays containing the different pieces of data I need and create and an array of objects. Then send it as json to the client. Save the file and switch to the browser and navigate to the /last-fm route and you should get a json response that looks like this:

I hosted the backend on Heroku and built a small front end with preact to consume the endpoints. The landing page is a list of upcoming albums and some details. You can click on an album to view the full tracklist. I added an extra route for the detail view, which I did not explain here, but the full source is on Github.

Legal notes

The legality of web scraping is a bit of a grey area and you should do some research if you are planning to do something bigger/commercial with the data you scrape. This is just a small demo so it should be okay.


The source code is on Github if you would like to check it out. Leave a star if you found it useful and share. That would be nice.


Thanks for reading.