Webscraping on hard mode with Purescript

Published 2022-08-30

I pay for a membership at a site called woodworkingmasterclasses.com. The nice part of this site: it's filled with videos of Paul Sellers making stuff out of wood. As a guy who's reached his 30s, the people-making-things-out-of-wood genre has become suddenly important to me.

The bad part of the site: it doesn't allow downloads. And I want to download, because I have a long flight with no internet coming up. So, let's fix this problem. Let's do some web scraping with Purescript!

How hard could it be?

The initial plan:

The projects on the site are split across different episodes, each episode has its own page, and each page has a video hosted by Vimeo.

Thus, we need to figure out authentication, listing out the episodes, scraping html, and downloading things from vimeo.

I have no idea how to do any of those steps with Purescript yet, but if Vimeo is anything like Youtube, I assume the hardest part will be figuring out how piece together all the video fragments. So, that's where I start.

Reconnaissance: figuring out how to download videos from Vimeo

Actually, it turns out this is surprisingly easy.

Load any random public vimeo page and keep an eye on the network tab. Amongst all the noise, you'll find a call to a /config endpoint. This conveniently lists out whatever .mp4 files are available for direct download.

{
	"request": {
		"files": {
			"progressive": [{
				"mime": "video/mp4",
				"url": "https://vod-progressive.akamaized.net/...",
				"cdn": "akamai_interconnect",
				...
			}]
		}
	},
	...
}

For public content, it generally seems to only have a single low-res download option, presumably as some kind of fallback. However, super conveniently, for the privately hosted videos that I'm after, the 1080p files are just sitting there ready for the taking. So, the only real technical thing to solve is how to get the config metadata for those private videos. While the endpoints work fine for public stuff, trying to hit it for the videos that are embedded on the site gives a 403:

{"message": "Because of its privacy settings, this video cannot be played here.", "title": "Sorry", "view": 7}

Some quick pursuing around the Vimeo docs, and it looks like a safe bet that the videos are configured with domain level privacy, which means that we just need to spoof the Referer header in order to get it to let us retrieve the private metadata.

Sure enough, as long as I say I'm from woodworkingmasterclasses.com, I can freely grab the associated metadata.

As an interesting aside, it looks like vimeo's identifiers are all just 9 character numeric strings. If that's right, that's a pretty small key space – like, really small, right? If all you need to know in order to download stuff that's guarded by "domain private" is (a) the domain, and (b) the identifier of the video, it seems like you could skip all the scraping (or paying!) for content by just exhaustively trying every possible URL. If you could get 100 requests/sec out of a Lambda, and you can spin up 2k of them at a time in a single account (and ignoring things like rate limiting), you could chew through a billion IDs in about an hour. It'd probably all be within the free tier.

I kicked around this idea for a bit, but decided rate limiting would be obnoxious to work around, and I'm just here to explore the Purecsript ecosystem a bit more, not attack vimeo's APIs with my own ramshackle army of Lambdas. So I continue on with the scraping approach.

Making HTTP requests in Purescript

Affjax seems like the popular candidate. In addition to being the top hit for "purescript http requests," it's the library covered in Purescript by Example, and also one of the ones suggested in Jordan's Purescript Reference. Not knowing anything about the ecosystem, I go with it.

First steps are easy enough. Requests against my own domain work as expected.

main :: Effect Unit 
main = launchAff_ do 
  result <- AN.get ResponseFormat.string "https://chriskiehl.com"
  case result of 
    Right response -> log "Hooray!"
    Left err -> log $ AN.printError err

Next I tried grabbing one of the Vimeo /config URLs

main :: Effect Unit 
main = launchAff_ do 
  result <- AN.request AN.defaultRequest {
      url="https://player.vimeo.com/video/668257777/config?h=&app_id="
    , headers=[(RequestHeader "Referer" "https://woodworkingmasterclasses.com/")]
    , responseFormat=ResponseFormat.string 
  }
  case result of 
    Right response -> log response.body 
    Left err -> log $ AN.printError err

And succes– wait, I'm getting that privacy error.

Because of its privacy settings, this video cannot be played here

But I am sending the Referer header. Or, at least, the code says I should be sending it.

Purescript Debugging part I

I don't know what the state of the art is for debugging Purescript, but my first go-to is Purescript Debug, which is a glorious little library for completely ignoring the type system in order to do dangerous, unsafe side-effects like logging (the horror!), but all without having to worry about Show instances, or changing type signatures.

main :: Effect Unit 
main = launchAff_ do 
  result <- AN.request (spy "request" AN.defaultRequest {
      url="https://player.vimeo.com/video/668257777/config?h=&app_id="
    , headers=[(RequestHeader "Referer" "https://woodworkingmasterclasses.com/")]
    , responseFormat=ResponseFormat.string 
  })

I spend some time poking and prodding around with spy to dump out logs, but everything looks like it's correct. The headers are definitely being set on the purescript side

Request: {
  url: 'https://player.vimeo.com/video/000000000/config?h=&app_id=',
  headers: [
    RequestHeader {
      value0: 'Referer',
      value1: 'https://woodworkingmasterclasses.com/'
    },
    RequestHeader {
      value0: 'User-Agent',
      value1: 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0'
    }
  ],
  ...
}

At this point, I'm not sure if it's something in the library and headers aren't being sent, or if it's user error, and I'm messing something up. So, I do a quick sanity check by sending the same request and headers with Python

url = "https://player.vimeo.com/video/00000000/config?h=&app_id="  
resp = requests.get(url, headers={'Referer': 'https://woodworkingmasterclasses.com/'})  
print(resp.status_code)

and it works just fine.

200 OK

So, it's something about how the Purecsript library is handling, or not handling, or completely ignoring my headers, but superficial logging can't answer why. I need to see what's being put on the wire.

Debugging Part II - looking at request header via Python

Python having a simple http.server just hanging out in its standard lib is one of the things that makes it such an endearing language. I spin up a minimal server to echo out whatever headers it receives.

class HeyListen(http.server.SimpleHTTPRequestHandler):  
    def do_GET(self):  
        print(self.headers)  
        return super().do_GET()  
  
Handler = http.server.SimpleHTTPRequestHandler  
  
with socketserver.TCPServer(("", 8080), HeyListen) as httpd:  
    httpd.serve_forever()

Ping it from the Purescript side.

main :: Effect Unit 
main = launchAff_ do 
  result <- AN.request AN.defaultRequest {
      url="http://127.0.0.1:8080/"  -- pointing at the dev server
    , headers=[
          (RequestHeader "Referer" "https://woodworkingmasterclasses.com/")
        , (RequestHeader "User-Agent" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0")
      ]
    , responseFormat=ResponseFormat.string 
  }

And sure enough, it's missing the Referer header. >: (

User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0
Connection: keep-alive
Host: 127.0.0.1:8080
Content-Length: 0

Debugging Part III

I need to be able to debug into the libraries Purescript is using to see what's actually going on. I don't know how to attach a debugger to Purescript itself, but doing so in Node is super easy, so debugging the transpiled source it is!

With a mix of the Spago docs, and this old forum post, I'm able to get things bundled and into a shape where I can attach a debugger.

yarn spago bundle-module --main Main --to index.js --platform node
yarn esbuild ./output/Main/index.js --bundle --platform=node > index.js

A few minutes stepping through the execution, which is pretty wacky given the currying that goes on in the transpiled source, I eventually find where requests are processed, and see the offending line.

setRequestHeader(name4, value3) {
	...
	// What?!  
	if (this._restrictedHeaders[loweredName] || /^sec\-/.test(loweredName) || /^proxy-/.test(loweredName)) {
	  console.warn(`Refused to set unsafe header "${name4}"`);
	  return void 0;
	}
	value3 = value3.toString();
	...
	return void 0;
  }

It's one of those times where things not working is literally a feature, not a bug.

XHR2 Patching Part I

It turns out Affjax's Node Driver is built on top of xhr2, which is an old coffeescript library that emulates the browser's XMLHttpRequest object in node. The problem is how well it emulates it. XHR, being a browser specific thing, means that certain things, like setting the referer header, are banned. xhr2 dutifully mirrors this behavior and prevents the the client from controlling restricted headers.

XMLHttpRequest.prototype._restrictedHeaders = {
	...
	host: true,
	'keep-alive': true,
	origin: true,
	referer: true,   <------   >:(
	te: true,
	trailer: true,
	...

I should have taken this as a sign of pain to come, but this was "just a one line fix", so I patch the source to un-ban referer . The business of scraping resumes.

Parsing json in Purescript

Marshalling json into records is done via Argonaut. The github page has a nice README with two decent examples. From what I can tell about Purescript culture, these two examples probably push the library into being considered over documented and in poor taste – bordering on insulting the reader's intelligence. Read the source or GTFO.

Still, getting the happy path working is basically a one-liner, which seems like dark magic.

type VimeoConfig = {
  video :: {title :: String},
  request :: {
    files :: {
      progressive :: NonEmptyArray {
        mime :: String,
        url :: String,
        id :: String, 
        height :: Int,
        width :: Int 
      }
    }
  }
}


configFromJson :: Json -> Either JsonDecodeError VimeoConfig
configFromJson = decodeJson

A brief pause for refactoring

Parsing json brings parser errors, and that brings the joys of getting types to line up. Unlike the familiar and cozy Exception hierarchy that you'd find in languages like Java or Python, where errors are always rooted in a some bedrock base type, a JsonDecodeError from Argonaut has absolutely nothing in common with the Error being returned from Affjax, and neither of those have anything in common with the Error which comes from Aff or Effect. Meaning, it makes composing the pieces together super difficult and tedious.

There's probably something that solves all of this elegantly, but as an impatient Purescript amateur, I just break out the hammer and start writing plumbing code to standardize all the error types under a single Error type so things line up again.

-- Swapped from AffJax Error to Aff.Error 
sendRequest :: forall a. AN.Request a -> Aff (Either Error (AN.Response a))
sendRequest request = do 
  result <- AN.request request 
  pure $ case result of 
    Right success -> Right success 
    Left err -> Left $ error (A.printError (spy "ERROR" err))


-- Ditto, but JsonDecodeError to Aff.Error 
vimeoConfigFromJson :: Json -> Either Error VimeoConfig 
vimeoConfigFromJson json = case (decodeJson json) of 
  Right result -> Right result 
  Left err -> Left $ error (show err)

That now lets me actually feed the result of one call into another, which is progress!


main :: Effect Unit 
main = launchAff_ do 
  let request = AN.defaultRequest {
      url="https://player.vimeo.com/video/379726407/config"
    , headers=[
          (RequestHeader "Referer" "https://woodworkingmasterclasses.com/")
        , (RequestHeader "User-Agent" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0")
      ]
    , responseFormat=ResponseFormat.json
  }
  result <- sendRequest request
  vimeoMetadata <- pure $ result <#> _.body >>= vimeoConfigFromJson
  case vimeoMetadata of 
    Right x -> log (show x)
    Left err -> log (show err)

The compiler is happy. I am happy.

Until I try making another request.

Patching FormData

Next is trying to get the episode list from the site. This is a public endpoint serving json, so should be pretty straight forward. The only quirk is that it takes form-data as a payload. No big deal, Purescript has a library for building form-data.

spago install web-xhr

FormData being an Effect should have set off some warning flags, but I ignore them and get things stitched together.

listEpisodes :: Aff (Either Error (Array Episode))
listEpisodes = do 
  payload <- liftEffect $ buildPayload
  result <- sendRequest (buildRequest payload) 
  pure $ result <#> _.body >>= episodesFromJson
  where 
  buildRequest :: FD.FormData -> AN.Request Json
  buildRequest payload = AN.defaultRequest {
      url="https://woodworkingmasterclasses.com/wp-admin/admin-ajax.php"
    , responseFormat=ResponseFormat.json
    , method = Left POST
    , content = Just $ FormData payload 
    } 
  
  buildPayload :: Effect FD.FormData
  buildPayload = do 
    form <- FD.new 
    FD.set (FD.EntryName "action") "wwmc_search_videos" form 
    FD.set (FD.EntryName "searchData[sorting]") "date desc" form 
    FD.set (FD.EntryName "searchData[Episodes]") "paid" form 
    FD.set (FD.EntryName "searchData[showIntros]") "show" form 
    FD.set (FD.EntryName "searchData[showEpisodes]") "show" form 
    FD.set (FD.EntryName "searchData[showStandalones]") "show" form 
    FD.set (FD.EntryName "searchData[watchedVideos]") "show" form 
    FD.set (FD.EntryName "searchData[keywords]") "" form 
    pure form

Everything compiles, so I give it a try.

main :: Effect Unit 
main = launchAff_ do 
  result <- listEpisodes  
  case result of 
    Right x -> log (show x)
    Left err -> log (show err)

And.... another error.

file:///C:/Users/.../Effect.Aff/foreign.js:530
                throw util.fromLeft(step);
                ^
TypeError: fd.set is not a function
    at Module._set (file:///C:/Users/.../Web.XHR.FormData/foreign.js:29:6)

So, it turns out that purescript-web-xhr assumes its going to be run in the browser, and will explode when run under Node since the FormData API doesn't exist. However, this is a pretty simple fix since, like xhr, there's a node port available on NPM.

Once installed, it's a quick patch Purescript-Web-XHR's FFI definitions and we're back working again

// importing our shim
import FormData from 'form-data';

const newImpl = function () {
  const fd = FormData();  // <-- no longer explodes!
  fd.set = fd.append;  // <-- minor patching to satisfy the expected API
  return fd;
};

Until we hit the next error.

Unsupported send() data ${data}

More debugging and it's xhr2 again.

XHR patching part II

xhr2 is picky with the payload types. If it's not a string or a buffer, it'll just throw its hands up and explode. The problem is that the FormData object is being passed from the Purescript side, so when it finds that type in the payload, it bombs out.

_setData(data) {
  var body, i, j, k, offset, ref, ref1, view;
  if (typeof data === 'undefined' || data === null) {
	return;
  }
  if (typeof data === 'string') {
	...
  } else if (Buffer.isBuffer(data)) {
	...
  } else {
	throw new Error(`Unsupported send() data ${data}`);
  }

I assume the idea on the xhr side was that Form Data would be turned into a buffer before being handed over to it. However, that's not how things shook out, so I again just patch to add quick support the datatype.

_setData(data) {
  var body, i, j, k, offset, ref, ref1, view;
  if (typeof data === 'undefined' || data === null) {
	return;
  }
  if (typeof data === 'string') {
	...
  } else if (Buffer.isBuffer(data)) {
	...
  } else if (typeof data == 'object') { // NEW! 
	  this._contentType = data.getHeaders()['content-type']; 
	  this._body = data.getBuffer();
  } else {
	throw new Error(`Unsupported send() data ${data}`);
  }

Patch is completed. Things appear to be working. Back to the task of scraping – oh, wait... another error.

Login attempts and XHR2 patching part III

Login is failing. More stepping through with a debugger. More cookie woes. set-cookie is considered a "private" header, and thus gets stripped from the response.

Don't treat on me, xhr2 you bastard. I remove all the restrictions I see.

XMLHttpRequest.prototype._restrictedHeaders = {};
...
XMLHttpRequest.prototype._privateHeaders = {};

And attempt my login again.

login :: Aff (Either Error AuthCookie)
login = do 
  payload <- liftEffect buildPayload
  result <- sendRequest (buildRequest payload)
  pure $ result <#> _.headers >>= grabCookie 
  where 
  buildPayload :: Effect FD.FormData
  buildPayload = do 
    form <- FD.new 
    FD.set (FD.EntryName "log") "USERNAME" form
    FD.set (FD.EntryName "pwd") "PASWORD" form 
    FD.set (FD.EntryName "wp-submit") "Log+In" form
    FD.set (FD.EntryName "redirect_to") "https://woodworkingmasterclasses.com/dashboard/" form
    FD.set (FD.EntryName "testcookie") "1" form
    pure form 

  buildRequest :: FD.FormData -> AN.Request String
  buildRequest payload = AN.defaultRequest {
      url = "https://woodworkingmasterclasses.com/wp-login.php" 
    , method = Left POST
    , headers=[
        (RequestHeader "User-Agent" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0")
      , (RequestHeader "Referer" "https://woodworkingmasterclasses.com/wp-login.php")
      , (RequestHeader "Cookie" "COOKIE HERE")
      , (RequestHeader "Origin" "https://woodworkingmasterclasses.com")
    ]
    , responseFormat=ResponseFormat.string
    , content = Just $ FormData payload } 

  grabCookie :: Array ResponseHeader -> Either Error AuthCookie 
  grabCookie headers = 
    let cookie = find (\header -> RH.name header == "set-cookie") headers <#> RH.value
    in note (error $ "Unable to find cookie in login response\n" <> (show headers)) cookie

but still no dice. While the initial POST succeeds, subsequent redirects fail because the auth cookie isn't being propagated.

Login attempts and XHR2 patching part VI

At this point, it's a lot of manual tinkering to figure out what's wrong with the cookies, how they need formatted, and how to move them around in xhr.

I also anger Cloudflare at this point and start getting throttled even though I'm going at human speeds. So, my brute force plan from above surely would have encountered similar troubles.

I ultimately give up on getting xhr2 to manage cookies between requests. I remind myself that the goal here is to download videos before a plane ride, not extend an XHR emulator's functionality. I settle for ripping out the redirect logic, and just returning the initial set-cookie formatted for their request Cookie counterpart.

if (loweredName === "set-cookie") {
	value = response.headers['set-cookie']
	  .map(x => x.split(";")[0])
	  .filter(x => x.startsWith("ael") || x.startsWith("w") || x.startsWith("pum"))
	  .join("; ")
  }

With this, finally, login works and returns the appropriate cookies.

Another brief pause to refactor

While the functionality itself is coming together, the code is a nightmare. There are monads inside of monads inside of monads – some libraries are even so rude as to use a different monad that the primary one used everywhere else. I'm rocking like 60% pure plumping code at this point.

So, I take several hours to wrap my head around Monad Transformers, rethink my life, and refactor everything I've ever written. I finally turn on the machine and start downloading those sweet, sweet videos.

main :: Effect Unit 
main = launchEitherAff do 
  cookie <- login  
  episodes <- listEpisodes cookie
  _ <- traverse (downloadEpisode cookie) (map _.permalink episodes)
  pure unit

until...

Things break again and I give up

I make it about 5 videos in before things explode again.

node:internal/validators:93
      throw new ERR_OUT_OF_RANGE(name, `>= ${min} && <= ${max}`, value);
      ^

RangeError [ERR_OUT_OF_RANGE]: The value of "length" is out of range. It must be >= 0 && <= 2147483647. Received 3486181694
    at Object.write (node:fs:817:5)
    at writeAll (node:fs:2068:6)
    at node:fs:2130:7
    at FSReqCallback.oncomplete (node:fs:188:23) {

Turns out Node has a really small bound on the allowable size for a buffer. Since, in my attempt at a quick and dirty smash and grab, I was just eagerly pulling the entire .mp4 into memory without care, eventually I'd land on a video that exceeded that limit and everything would explode.

I spend some time poking around the sources a bit to see what it'd take to enable streaming downloads, but we're talking deep cuts into multiple areas and across multiple languages and libraries. I still just want to download a few things before I hop on a plane, so... this is where I gave up.

...

Was it all for nothing? Did I waste hours of my life? Do I feel happier for finally understanding what a monad transformer is? Did I ever get to download those videos? Does this actually just trail off with a series of unanswered questions...?!