PhantomJS module?
#1
I've recently been using phantomjs inside an addon to scrape content from a javascript-heavy site. It occurred to me that it might be useful to extract this into a module that can be used by other plugin developers.

Rationale
Many content sites today are heavily Javascript oriented and can be hard if not impossible to scrape using simple HTTP libraries. For some sites you really need a full browser with a DOM implementation and Javascript support. PhantomJS is a headless webkit browser which is fully scriptable and is well suited to this task.

Features
The module would provide an easy way for plugin developers to scrape sites using phantomjs. It would provide:
  • A Python API to invoke phantomjs with a given script, passing arguments to it and returning results as a dictionary
  • A Javascript PhantomJS script with some useful utilities for writing scraper scripts
  • Possibly a "background" mode which keeps the phantomjs process running so it can be reused on subsequent plugin calls.
  • Possibly automatic installation of phantomjs for the user.


Just wondering if there's any interest in such a module, or if there is better way to approach the problem altogether?
Reply
#2
Can it run on android?
Can it run on the RPi and still play an HD video?

I haven't found a site yet that couldn't be scraped in python. Can you give a good scenario?
Reply
#3
Platform support is indeed a problem, especially for iOS/Android. There are binaries available for Raspberry Pi, but I don't know what performance impact it might have. So yes, phantomjs is a last resort solution.

My specific use case is interacting with the rdio website (http://rdio.com) (yes, they do have an API but for various reasons there are still things that require the site to be accessed directly). It's a single page webapp written with backbone.js. If you can figure out how to log in using python only let me know. Hitting the sign in button triggers some javascript which inserts an authorization key into the sign in XHR. Without it the request fails, but the key is not available in the page's HTML.

Edit: Here's another forum post with a similar problem: http://forum.xbmc.org/showthread.php?tid=141371
Reply
#4
heimdall (or extra modules) would perhaps be interested in this, but it would most likely be windows/os x and linux specific. PhantomJS depends on Qt.
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#5
I agree that some way of interpreting javascript automatically would be nice, I'm just trying to get a list of problems going so it's more obvious what needs to be solved.

I was helping Eldorado work through that one and he eventually cracked it (javascript and python are pretty similar from a readability standpoint).

I think the most promising that we found was pyv8: https://code.google.com/p/pyv8/
Reply
#6
Posting this thread prompted me to think harder about alternatives, and I've managed to find a Python only solution to my problem. It's a bit messy in that it requires extracting some data from the javascript using a regex, but it does work. It's not too hard to envisage a situation in which a full JS and DOM engine would be necessary to figure things out, although perhaps best left until it's clearer what's needed (if anything).
Reply
#7
i have a quite complicated example - this one uses salts from server side and md5 to encode everything.
just GET with incognito mode

http://vod.walla.co.il/
Reply

Logout Mark Read Team Forum Stats Members Help
PhantomJS module?0