Scraping an SPA

I recently needed to scrape some data from https://mapping.ncua.gov/ResearchCreditUnion. I hadn't done any web scraping in years, so I made this guide to document the process.

First Steps

The data I need is for credit unions that meet certain criteria. I can use the site's search filters to yield only the results I want–on the order of 100 out of over 20,000. The search page is an SPA ("search" just changes what's rendered instead of visiting a new page) but each result links to a new page.

I notice some custom HTML attributes1 along the lines of _ngcontent-fsp-c114 throughout the page. A web search reveals them to be generated by Angular.js. So how do you scrape an Angular SPA?

Scraping with Puppeteer

Some research turns up a few tools for scraping a dynamic website:

Puppeteer seems to be the most widely used, so I'll pick a nice tutorial and get started. Before running anything I'll check for a robots.txt. There isn't one, so here we go!

First I make a project directory and install puppeteer:

mkdir scrape-spa
cd scrape-spa
npm i puppeteer --save

I try and run some code from the tutorial that visits a page and generates a screenshot. After a small change to generate a larger screenshot, I have:

const puppeteer = require('puppeteer');
const url = process.argv[2];
if (!url) {
    throw "Please provide URL as a first argument";
}
async function run () {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    await page.setViewport({
        width: 1080,
        height: 2160,
        deviceScaleFactor: 1,
    });
    await page.screenshot({path: 'screenshot.png'});
    browser.close();
}
run();

It works! Now for interacting with the page. Since I'm working in the dark, I'll try and issue a click to the page and then take a screenshot to display the result. By looking at the page source, I identify an attribute of the HTML element I want to click:

<mat-select _ngcontent-jcn-c114="" role="combobox" aria-autocomplete="none" aria-haspopup="true" name="cuStatus" formcontrolname="cuStatus" class="mat-select ng-tns-c47-9 ng-tns-c30-8 mat-select-empty ng-untouched ng-pristine ng-valid ng-star-inserted" aria-labelledby="mat-form-field-label-11 mat-select-value-3" id="mat-select-2" tabindex="0" aria-expanded="false" aria-required="false" aria-disabled="false" aria-invalid="false">...</mat-select>

I can then programmatically find this element with puppeteer and issue a click2:

const dropdown = await page.$('[name="cuStatus"]');
await dropdown.click();

So far so good. Manually clicking this dropdown adds some new elements to the DOM. I've got to click one of these to make a selection:

const [selection] = await page.$x("//span[contains(., 'Active')]");
if (selection) {
    await selection.click();
}

It's not clear from the screenshot that this worked; but, by adding a wait interval between the click and the screenshot, it is:

await page.waitForTimeout(1000);

We can manipulate the other fields similarly until we are ready to perform the search. Then we find the button and click:

const [button] = await page.$x("//button[@title=\"find more details\"]/span[contains(., 'FIND')]")
if (button) {
    await button.click();
}

Executing the search populates the page with the first 20 search results, and displays buttons for navigating to the rest. Each result contains a button that links to the corresponding page, where the data I need is located. So we need to:

  • scrape the hyperlink value off of each result
  • click to continue to the next page (if necessary) and repeat

Scraping the hyperlink can be done like this3:

let links = []
const buttons = await page.$x("//a/span[contains(., 'VIEW')]/..");
for( let button of buttons ) {
    const attr = await page.evaluate(el => el.getAttribute("href"), button);
    links.push("https://mapping.ncua.gov".concat(attr)) // not sure this is nec.
}

Unfortunately this approach results in exactly two copies of each link, because the table of search results exists in the DOM twice:

<div _ngcontent-fsp-c114="" tabindex="0" class="tb-desktop-container">...</div>
<div _ngcontent-fsp-c114="" tabindex="0" class="tb-mobile-container">...</div>

I can fix this with the following spaghetti code4:

let links = [];
let next = null;
do {
    let newLinks = []
    const [nextButton] = await page.$x("//button[@aria-label=\"Next page\" and not(@disabled)]");
    next = nextButton;
    const buttons = await page.$x("//a/span[contains(., 'VIEW')]/..");
    for( let b of buttons ) {
        const attr = await page.evaluate(el => el.getAttribute("href"), b);
        newLinks.push("https://mapping.ncua.gov".concat(attr))
    }
    newLinks = newLinks.slice(0,newLinks.length/2)
    links = links.concat(newLinks)
    if (next) {
        await next.click();
        await page.waitForTimeout(1000);
    }
} while (next);

I've conquered the SPA! Now it's time to visit all the scraped links and get the data I need.

It's probably a reasonable assumption that each page renders the same fields, but I'll err on the side of collecting more data and verify that assumption later:

var dict = {}
for( let l of links ) {
    await page.goto(l);
    await page.waitForTimeout(2000);
    const fieldElements = await page.$x("//table[@class=\"table-details\"]/tbody/tr/td[@class=\"dvHeader\"]");
    let fields = []
    for( let e of fieldElements ) {
        const field = await page.evaluate(el => el.textContent, e);
        fields.push(field)
    }
    const valueElements = await page.$x("//table[@class=\"table-details\"]/tbody/tr/td[not(@class)]");
    let vals = []
    for( let e of valueElements ) {
        const val = await page.evaluate(el => el.textContent, e);
        vals.push(val)
    }
    dict[p] = {
        Keys: fields,
        Vals: vals
    }
}

Again, there's probably a lot of duplication here (the field names should be the same for each page). But now I should have everything I need. I'll just write it to a file5 and then have a closer look in a Node REPL:

var fsp = require('fs/promises');
await fsp.writeFile("data.json",JSON.stringify(dict));

Cleaning and Exporting

I need to make sure I've got the same fields for each page, and then I've got to export the data to Excel somehow: CSV seems like a good option.

I'll get a Node REPL open and import the data6:

const fs = require('fs');
let rawdata = fs.readFileSync('data.json');
let data = JSON.parse(rawdata);

Now I can easily confirm that the keys are the same7 for each page in the dictionary:

let model = data[Object.keys(data)[0]].Keys
for ( let key of Object.keys(data) ) {
    let curr = data[key].Keys;
    if (!(model.length === curr.length && model.every(function(value, index) { return value === curr[index]}) )) {
        console.log("uh oh");
    }
}

No output! They're all the same (phew). We can easily construct the CSV now, starting with the header:

let fields = [];
for ( let field of data[Object.keys(data)[0]].Keys ) {
    fields.push("\"".concat(field.slice(0,-1),"\""))
}
let header = fields.join(",");

Slicing removes colons, and enclosing quotation marks keep commas in field names from poisoning the CSV. Now for the data:

let rows = []
for ( key of Object.keys(data) ) {
    let vals = []
    for ( v of data[key].Vals ) {
        vals.push("\"".concat(v.trim(),"\""));
    }
    row = vals.join(",")
}
let csv = header.concat("\n",rows.join("\n"))

All that's left is to export the file:

var fsp = require('fs/promises')
await fsp.writeFile("data.csv", csv);

Done!

Footnotes:

hi