Scraping an SPA
I recently needed to scrape some data from https://mapping.ncua.gov/ResearchCreditUnion. I hadn't done any web scraping in years, so I made this guide to document the process.
First Steps
The data I need is for credit unions that meet certain criteria. I can use the site's search filters to yield only the results I want–on the order of 100 out of over 20,000. The search page is an SPA ("search" just changes what's rendered instead of visiting a new page) but each result links to a new page.
I notice some custom HTML attributes1 along the lines of _ngcontent-fsp-c114 throughout the page. A web search reveals them to be generated by Angular.js. So how do you scrape an Angular SPA?
Scraping with Puppeteer
Some research turns up a few tools for scraping a dynamic website:
Puppeteer seems to be the most widely used, so I'll pick a nice tutorial and get started. Before running anything I'll check for a robots.txt. There isn't one, so here we go!
First I make a project directory and install puppeteer:
mkdir scrape-spa
cd scrape-spa
npm i puppeteer --save
I try and run some code from the tutorial that visits a page and generates a screenshot. After a small change to generate a larger screenshot, I have:
const puppeteer = require('puppeteer'); const url = process.argv[2]; if (!url) { throw "Please provide URL as a first argument"; } async function run () { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto(url); await page.setViewport({ width: 1080, height: 2160, deviceScaleFactor: 1, }); await page.screenshot({path: 'screenshot.png'}); browser.close(); } run();
It works! Now for interacting with the page. Since I'm working in the dark, I'll try and issue a click to the page and then take a screenshot to display the result. By looking at the page source, I identify an attribute of the HTML element I want to click:
<mat-select _ngcontent-jcn-c114="" role="combobox" aria-autocomplete="none" aria-haspopup="true" name="cuStatus" formcontrolname="cuStatus" class="mat-select ng-tns-c47-9 ng-tns-c30-8 mat-select-empty ng-untouched ng-pristine ng-valid ng-star-inserted" aria-labelledby="mat-form-field-label-11 mat-select-value-3" id="mat-select-2" tabindex="0" aria-expanded="false" aria-required="false" aria-disabled="false" aria-invalid="false">...</mat-select>
I can then programmatically find this element with puppeteer and issue a click2:
const dropdown = await page.$('[name="cuStatus"]'); await dropdown.click();
So far so good. Manually clicking this dropdown adds some new elements to the DOM. I've got to click one of these to make a selection:
const [selection] = await page.$x("//span[contains(., 'Active')]"); if (selection) { await selection.click(); }
It's not clear from the screenshot that this worked; but, by adding a wait interval between the click and the screenshot, it is:
await page.waitForTimeout(1000);
We can manipulate the other fields similarly until we are ready to perform the search. Then we find the button and click:
const [button] = await page.$x("//button[@title=\"find more details\"]/span[contains(., 'FIND')]") if (button) { await button.click(); }
Executing the search populates the page with the first 20 search results, and displays buttons for navigating to the rest. Each result contains a button that links to the corresponding page, where the data I need is located. So we need to:
- scrape the hyperlink value off of each result
- click to continue to the next page (if necessary) and repeat
Scraping the hyperlink can be done like this3:
let links = [] const buttons = await page.$x("//a/span[contains(., 'VIEW')]/.."); for( let button of buttons ) { const attr = await page.evaluate(el => el.getAttribute("href"), button); links.push("https://mapping.ncua.gov".concat(attr)) // not sure this is nec. }
Unfortunately this approach results in exactly two copies of each link, because the table of search results exists in the DOM twice:
<div _ngcontent-fsp-c114="" tabindex="0" class="tb-desktop-container">...</div> <div _ngcontent-fsp-c114="" tabindex="0" class="tb-mobile-container">...</div>
I can fix this with the following spaghetti code4:
let links = []; let next = null; do { let newLinks = [] const [nextButton] = await page.$x("//button[@aria-label=\"Next page\" and not(@disabled)]"); next = nextButton; const buttons = await page.$x("//a/span[contains(., 'VIEW')]/.."); for( let b of buttons ) { const attr = await page.evaluate(el => el.getAttribute("href"), b); newLinks.push("https://mapping.ncua.gov".concat(attr)) } newLinks = newLinks.slice(0,newLinks.length/2) links = links.concat(newLinks) if (next) { await next.click(); await page.waitForTimeout(1000); } } while (next);
I've conquered the SPA! Now it's time to visit all the scraped links and get the data I need.
It's probably a reasonable assumption that each page renders the same fields, but I'll err on the side of collecting more data and verify that assumption later:
var dict = {} for( let l of links ) { await page.goto(l); await page.waitForTimeout(2000); const fieldElements = await page.$x("//table[@class=\"table-details\"]/tbody/tr/td[@class=\"dvHeader\"]"); let fields = [] for( let e of fieldElements ) { const field = await page.evaluate(el => el.textContent, e); fields.push(field) } const valueElements = await page.$x("//table[@class=\"table-details\"]/tbody/tr/td[not(@class)]"); let vals = [] for( let e of valueElements ) { const val = await page.evaluate(el => el.textContent, e); vals.push(val) } dict[p] = { Keys: fields, Vals: vals } }
Again, there's probably a lot of duplication here (the field names should be the same for each page). But now I should have everything I need. I'll just write it to a file5 and then have a closer look in a Node REPL:
var fsp = require('fs/promises'); await fsp.writeFile("data.json",JSON.stringify(dict));
Cleaning and Exporting
I need to make sure I've got the same fields for each page, and then I've got to export the data to Excel somehow: CSV seems like a good option.
I'll get a Node REPL open and import the data6:
const fs = require('fs'); let rawdata = fs.readFileSync('data.json'); let data = JSON.parse(rawdata);
Now I can easily confirm that the keys are the same7 for each page in the dictionary:
let model = data[Object.keys(data)[0]].Keys for ( let key of Object.keys(data) ) { let curr = data[key].Keys; if (!(model.length === curr.length && model.every(function(value, index) { return value === curr[index]}) )) { console.log("uh oh"); } }
No output! They're all the same (phew). We can easily construct the CSV now, starting with the header:
let fields = []; for ( let field of data[Object.keys(data)[0]].Keys ) { fields.push("\"".concat(field.slice(0,-1),"\"")) } let header = fields.join(",");
Slicing removes colons, and enclosing quotation marks keep commas in field names from poisoning the CSV. Now for the data:
let rows = [] for ( key of Object.keys(data) ) { let vals = [] for ( v of data[key].Vals ) { vals.push("\"".concat(v.trim(),"\"")); } row = vals.join(",") } let csv = header.concat("\n",rows.join("\n"))
All that's left is to export the file:
var fsp = require('fs/promises') await fsp.writeFile("data.csv", csv);
Done!