Scraping an SPA
I recently needed to scrape some data from https://mapping.ncua.gov/ResearchCreditUnion. I hadn't done any web scraping in years, so I made this guide to document the process.
First Steps
The data I need is for credit unions that meet certain criteria. I can use the site's search filters to yield only the results I want–on the order of 100 out of over 20,000. The search page is an SPA ("search" just changes what's rendered instead of visiting a new page) but each result links to a new page.
I notice some custom HTML attributes1 along the lines of _ngcontent-fsp-c114 throughout the page. A web search reveals them to be generated by Angular.js. So how do you scrape an Angular SPA?
Scraping with Puppeteer
Some research turns up a few tools for scraping a dynamic website:
Puppeteer seems to be the most widely used, so I'll pick a nice tutorial and get started. Before running anything I'll check for a robots.txt. There isn't one, so here we go!
First I make a project directory and install puppeteer:
mkdir scrape-spa cd scrape-spa npm i puppeteer --save
I try and run some code from the tutorial that visits a page and generates a screenshot. After a small change to generate a larger screenshot, I have:
const puppeteer = require('puppeteer');
const url = process.argv[2];
if (!url) {
throw "Please provide URL as a first argument";
}
async function run () {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await page.setViewport({
width: 1080,
height: 2160,
deviceScaleFactor: 1,
});
await page.screenshot({path: 'screenshot.png'});
browser.close();
}
run();
It works! Now for interacting with the page. Since I'm working in the dark, I'll try and issue a click to the page and then take a screenshot to display the result. By looking at the page source, I identify an attribute of the HTML element I want to click:
<mat-select _ngcontent-jcn-c114="" role="combobox" aria-autocomplete="none" aria-haspopup="true" name="cuStatus" formcontrolname="cuStatus" class="mat-select ng-tns-c47-9 ng-tns-c30-8 mat-select-empty ng-untouched ng-pristine ng-valid ng-star-inserted" aria-labelledby="mat-form-field-label-11 mat-select-value-3" id="mat-select-2" tabindex="0" aria-expanded="false" aria-required="false" aria-disabled="false" aria-invalid="false">...</mat-select>
I can then programmatically find this element with puppeteer and issue a click2:
const dropdown = await page.$('[name="cuStatus"]');
await dropdown.click();
So far so good. Manually clicking this dropdown adds some new elements to the DOM. I've got to click one of these to make a selection:
const [selection] = await page.$x("//span[contains(., 'Active')]");
if (selection) {
await selection.click();
}
It's not clear from the screenshot that this worked; but, by adding a wait interval between the click and the screenshot, it is:
await page.waitForTimeout(1000);
We can manipulate the other fields similarly until we are ready to perform the search. Then we find the button and click:
const [button] = await page.$x("//button[@title=\"find more details\"]/span[contains(., 'FIND')]")
if (button) {
await button.click();
}
Executing the search populates the page with the first 20 search results, and displays buttons for navigating to the rest. Each result contains a button that links to the corresponding page, where the data I need is located. So we need to:
- scrape the hyperlink value off of each result
- click to continue to the next page (if necessary) and repeat
Scraping the hyperlink can be done like this3:
let links = []
const buttons = await page.$x("//a/span[contains(., 'VIEW')]/..");
for( let button of buttons ) {
const attr = await page.evaluate(el => el.getAttribute("href"), button);
links.push("https://mapping.ncua.gov".concat(attr)) // not sure this is nec.
}
Unfortunately this approach results in exactly two copies of each link, because the table of search results exists in the DOM twice:
<div _ngcontent-fsp-c114="" tabindex="0" class="tb-desktop-container">...</div> <div _ngcontent-fsp-c114="" tabindex="0" class="tb-mobile-container">...</div>
I can fix this with the following spaghetti code4:
let links = [];
let next = null;
do {
let newLinks = []
const [nextButton] = await page.$x("//button[@aria-label=\"Next page\" and not(@disabled)]");
next = nextButton;
const buttons = await page.$x("//a/span[contains(., 'VIEW')]/..");
for( let b of buttons ) {
const attr = await page.evaluate(el => el.getAttribute("href"), b);
newLinks.push("https://mapping.ncua.gov".concat(attr))
}
newLinks = newLinks.slice(0,newLinks.length/2)
links = links.concat(newLinks)
if (next) {
await next.click();
await page.waitForTimeout(1000);
}
} while (next);
I've conquered the SPA! Now it's time to visit all the scraped links and get the data I need.
It's probably a reasonable assumption that each page renders the same fields, but I'll err on the side of collecting more data and verify that assumption later:
var dict = {}
for( let l of links ) {
await page.goto(l);
await page.waitForTimeout(2000);
const fieldElements = await page.$x("//table[@class=\"table-details\"]/tbody/tr/td[@class=\"dvHeader\"]");
let fields = []
for( let e of fieldElements ) {
const field = await page.evaluate(el => el.textContent, e);
fields.push(field)
}
const valueElements = await page.$x("//table[@class=\"table-details\"]/tbody/tr/td[not(@class)]");
let vals = []
for( let e of valueElements ) {
const val = await page.evaluate(el => el.textContent, e);
vals.push(val)
}
dict[p] = {
Keys: fields,
Vals: vals
}
}
Again, there's probably a lot of duplication here (the field names should be the same for each page). But now I should have everything I need. I'll just write it to a file5 and then have a closer look in a Node REPL:
var fsp = require('fs/promises');
await fsp.writeFile("data.json",JSON.stringify(dict));
Cleaning and Exporting
I need to make sure I've got the same fields for each page, and then I've got to export the data to Excel somehow: CSV seems like a good option.
I'll get a Node REPL open and import the data6:
const fs = require('fs');
let rawdata = fs.readFileSync('data.json');
let data = JSON.parse(rawdata);
Now I can easily confirm that the keys are the same7 for each page in the dictionary:
let model = data[Object.keys(data)[0]].Keys
for ( let key of Object.keys(data) ) {
let curr = data[key].Keys;
if (!(model.length === curr.length && model.every(function(value, index) { return value === curr[index]}) )) {
console.log("uh oh");
}
}
No output! They're all the same (phew). We can easily construct the CSV now, starting with the header:
let fields = [];
for ( let field of data[Object.keys(data)[0]].Keys ) {
fields.push("\"".concat(field.slice(0,-1),"\""))
}
let header = fields.join(",");
Slicing removes colons, and enclosing quotation marks keep commas in field names from poisoning the CSV. Now for the data:
let rows = []
for ( key of Object.keys(data) ) {
let vals = []
for ( v of data[key].Vals ) {
vals.push("\"".concat(v.trim(),"\""));
}
row = vals.join(",")
}
let csv = header.concat("\n",rows.join("\n"))
All that's left is to export the file:
var fsp = require('fs/promises')
await fsp.writeFile("data.csv", csv);
Done!