Web Scraping with Deno
Maybe you've had this problem before: you need data from a website, but there isn't a good way to download a CSV or hit an API to get it. In these situations, you may be forced to write a web scraper. A web scraper is a script that downloads the HTML from a website as though it were a normal user, and then it parses that HTML to extract information from it. JavaScript is a natural choice for writing your web scraper, as it's the language of the web and natively understands the DOM. And with TypeScript, it is even better, as the type system helps you navigate the trickiness of HTMLCollections and Element types.
You can write your TypeScript scraper as a Node script, but this has some limitations. TypeScript support isn't native to Node. Node also doesn't support web APIs very well, only recently implementing fetch for example. Making your scraper in Node is, to be frank, a bit of a pain. But there's an alternative way to write your scraper in TypeScript: Deno!
Deno is a newer JavaScript/TypeScript runtime, co-created by one of Node's creators, Ryan Dahl. Deno has native support for TypeScript, and it supports web APIs out of the box. With Deno, we can write a web scraper in TypeScript with far fewer dependencies and boilerplate than what you'd need in order to write the same thing in Node.
In this blog, we’re going to build a web scraper using Deno. Along the way, we’ll also explore some of the key advantages of Deno, as well as some differences to be aware of if you’re coming from writing code exclusively for Node.
Getting Started
First, you'll need to install Deno. Depending on your operating system, there are many methods to install it locally.
After installation, we'll set up our project. We'll keep things simple and start with an empty directory.
`
Now, let's add some configuration. I like to keep my code consistent with linters (e.g. ESLint) and formatters (e.g. Prettier). But Deno doesn't need ESLint or Prettier. It handles linting and formatting itself. You can set up your preferences for how to lint and format your code in a deno.json file. Check out this example:
`
You can then lint and format your Deno project with deno lint and deno fmt, respectively.
If you're like me, you like your editor to handle linting and formatting on save. So once Deno is installed, and your linting/formatting options are configured, you'll want to set up your development environment too.
I use Visual Studio Code with the Deno extension. Once installed, I let VS Code know that I'm in a Deno project and that I'd like to auto-format by adding a .vscode/settings.json file with the following contents:
`
Writing a Web Scraper
With Deno installed and a project directory ready to go, let's write our web scraper! And what shall we scrape? Let's scrape the list of Deno 3rd party modules from the official documentation site. Our script will pull the top 60 Deno modules listed, and output them to a JSON file.
For this example, we'll only need one script. I'll name it index.ts for simplicity's sake. Inside our script, we'll create three functions: sleep, getModules and getModule.
The first will give us a way to easily wait between fetches. This will prevent issues with our script, and the owners of the website we’ll contact, because it's unkind to flood a website with many successive requests. This type of automation could appear like a Denial of Service (DoS) attack, and could cause the API owners to ban your IP address from contacting it. The sleep function is fully implemented in the code block below. The second function (getModules) will scrape the first three pages of the Deno 3rd party modules list, and the third (getModule) will scrape the details for each individual module.
`
Scraping the Modules List
First, let's write our getModules function. This function will scrape the first three pages of the Deno 3rd party modules list. Deno supports the fetch API, so we can use that to make our requests. We'll also use the deno_dom module to parse the HTML response into a DOM tree that we can traverse.
Two things we'll want to do upfront: let's import the deno_dom parsing module, and create a type for the data we're trying to scrape.
`
Next, we'll set up our initial fetch and data parsing:
`
Now that we're able to grab the contents of the first three pages of Deno modules, we need to parse the document to get the list of modules we want to collect information for. Then, once we've got the URL of the modules, we'll want to scrape their individual module pages with getModule.
Passing the text of the page to the DOMParser from deno_dom, you can extract information using the same APIs you'd use in the browser like querySelector and getElementsByTagName. To figure out what to select for, you can use your browser's developer tools to inspect the page, and find selectors for elements you want to select. For example, in Chrome DevTools, you can right-click on an element and select "Copy > Copy selector" to get the selector for that element.
`
Scraping a Single Module
Next we'll write getModule. This function will scrape a single Deno module page, and give us information about it.
If you're following along so far, you might've noticed that the paths we got from the directory of modules look like this:
`
But if you navigate to that page in the browser, the URL looks like this:
`
Deno uses redirects to send you to the latest version of a module. We'll need to follow those redirects to get the correct URL for the module page. We can do that with the redirect: 'follow' option in the fetch call. We'll also need to set the Accept header to text/html, or else we'll get a 404.
`
Now we'll parse the module data, just like we did with the directory of modules.
`
Writing Our Data to a File
Finally, we'll write our data to a file. We'll use the Deno.writeTextFile API to write our data to a file called output.json.
`
Lastly, we just need to invoke our getModules function to start the process.
`
Running Our Script
Deno has some security features built into it that prevent it from doing things like accessing the file system or the network without granting it explicit permission. We give it these permissions by passing the --allow-net and --allow-write flags to our script when we run the script.
`
After we let our script run (which, if you've wisely set a small delay with each request, will take some time), we'll have a new output.json file with data like this:
`
Putting It All Together
Tada! A quick and easy way to get data from websites using Deno and our favorite JavaScript browser APIs. You can view the entire script in one piece on GitHub.
In this blog, you've gotten a basic intro to how to use Deno in lieu of Node for simple scripts, and have seen a few of the key differences. If you want to dive deeper into Deno, start with the official documentation. And when you're ready for more, check out the resources available from This Dot Labs at deno.framework.dev, and our Deno backend starter app at starter.dev!...