Python script as scraper

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
dbrobins Offline
Member
Posts: 73
Joined: Dec 2009
Reputation: 0
Location: Redmond, WA
Question  Python script as scraper
Post: #1
It doesn't appear possible to use a Python script (instead of an XML file) as a scraper, although it would be handy to be able to do so. Is it actually possible (I've looked and didn't find any indication that it is)? If it isn't possible, would a patch to allow it be considered? (Apologies if this the wrong forum. This is not a feature request - I'm quite willing to implement it myself if it isn't there and there's no good reason not to allow it.)
find quote
blittan Offline
Team-XBMC Handyman
Posts: 1,739
Joined: Jun 2004
Reputation: 11
Location: Sweden
Post: #2
what would be the advantage with a python scraper? just curious Smile

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
If you don't have the time to read them, please don't take the time to post in this forum!
For troubleshooting and bug reporting please make sure you read this first.
find quote
dbrobins Offline
Member
Posts: 73
Joined: Dec 2009
Reputation: 0
Location: Redmond, WA
Post: #3
blittan Wrote:what would be the advantage with a python scraper? just curious Smile

To some, writing code is more natural than trying to put everything into a series of regexes constrained by that language (even in cases where equivalent power is being used).

There's a good consistency argument too: plugins are Python, while in theory it might have been possible to do them with XML too (although they'd be more limited). Why invent a new language for scrapers? I don't think "ease of use" is a good answer, because a Python template would be as easy or easier to manipulate as the current XML format.

Matching XML (or HTML) with regular expressions seems a little unsafe, although since it seems to be working so far with many sites that's more academic than a real issue.

There are also things the scraper language can't do, of course. As an example - and perhaps this is possible in the scraper language - I noticed that a couple of my movies were matched with TV show episodes of the same name, even though TMDB explicitly calls them out as episodes; I'd prefer matching movies be selected first, falling back to episodes if there are no movies in the list.

And perhaps this goes against the idea of the ability to run scrapers in the background - it might work better as a script - but for some movies I'd like to be able to select an alternate poster than the first one, or even select among the matched items if they're close.
find quote
jmarshall Offline
Team-XBMC Developer
Posts: 25,683
Joined: Oct 2003
Reputation: 169
Post: #4
Regarding the use of python for this, the ideal I think would be for all this stuff to be language-agnostic (as much as that is possible). The problem with this (whether it be python or something else) is how we invoke it and how we get stuff back. We have this with the built-in python interpreter, but passing stuff to a script is still problematic - eg at the moment for plugins we just run the script with suitable arguments whenever we need a directory listing which means you get no persistence.

If you have ideas as to how it can be accomplished, we're listening.

Note that things like identification of tvshows vs movies can be done using the scraper language anyway I should think (if they're marked, then a regexp can likely figure that out and demote them.)

As for selecting alternate posters or among matched items - this is something that requires user interaction anyway - in this case we should choose the "best" guess and it's up to the user to correct it via Movie Information.

Cheers,
Jonathan

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


[Image: badge.gif]
find quote
dbrobins Offline
Member
Posts: 73
Joined: Dec 2009
Reputation: 0
Location: Redmond, WA
Post: #5
jmarshall Wrote:Regarding the use of python for this, the ideal I think would be for all this stuff to be language-agnostic (as much as that is possible). The problem with this (whether it be python or something else) is how we invoke it and how we get stuff back. We have this with the built-in python interpreter, but passing stuff to a script is still problematic - eg at the moment for plugins we just run the script with suitable arguments whenever we need a directory listing which means you get no persistence.

If you have ideas as to how it can be accomplished, we're listening.

I may build a proof of concept - after I finish this compiler project - and if I get that far I'll discuss it here. I'm certainly not demanding "Build me this feature" - I just wanted to discuss feasibility before I start working on it.

One way to make it language agnostic would be to use a well-defined protocol to communicate (like some plugins already do internally - Medusa?). XBMC could start the program, pass info on stdin, and expect results on stdout (like MythTV's scapers), which allows for persistence if startup overhead is significant (keep sending requests and then send "done" and/or just terminate the program).

By the way, I was looking and didn't find an answer for this either: why embed Python? I don't mean why Python in particular, which is an admirable language, just why embed it rather than using some sort of arms-length language-agnostic protocol or providing a library interface, starting with Python but extensible by others to Perl, Ruby, etc., even compiled languages. Is this also in the area of "Sounds good, let's see some code" or are there more fundamental issues I'm missing?

Quote:Note that things like identification of tvshows vs movies can be done using the scraper language anyway I should think (if they're marked, then a regexp can likely figure that out and demote them.)

As for selecting alternate posters or among matched items - this is something that requires user interaction anyway - in this case we should choose the "best" guess and it's up to the user to correct it via Movie Information.

There seems to only be a choice between "Current thumb" and "Remote thumb" - is there a way to select from a set?
find quote
takoi Offline
Fan
Posts: 606
Joined: Oct 2009
Reputation: 7
Location: Norway
Post: #6
advantage: you can pass variables to the regex, which you cant do in scraper xml, i think
find quote
jmarshall Offline
Team-XBMC Developer
Posts: 25,683
Joined: Oct 2003
Reputation: 169
Post: #7
Reason it's embedded is mainly historical - we didn't have a choice on the xbox (one process at a time).

We certainly want to tidy up the python API - there's a bunch of legacy stuff that really shouldn't be used (eg python directly manipulating the UI, injecting controls etc.) as it's quite broken, and not really all that useful nowadays.

So yes, it's very much a "Sounds good, let's see some code" (or in this case, some sort of design spec first would be nice!) Smile

Cheers,
Jonathan

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


[Image: badge.gif]
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #8
you can have any number of remote thumbs - a fact we certainly take advantage of...

ventech, please, no guesses. sure you can pass info to the scraper, heck. how do you think we perform searches et al?
find quote
takoi Offline
Fan
Posts: 606
Joined: Oct 2009
Reputation: 7
Location: Norway
Post: #9
im sorry then, documentation is the advantage..
find quote
brulsmurf Offline
Member
Posts: 93
Joined: Oct 2008
Reputation: 0
Thumbs Up   
Post: #10
I know of at least one site (MovieMeter.nl) that provides an XML-RPC API for retrieving movie info. It cannot be used with the current scraper technology (I checked), but it can be used with a Python scraper.

My skills are way too limited to add a Python scraper feature to XBMC, but they just might be good enough to create a MovieMeter.nl scraper once such a feature would be available Smile
find quote
vikjon0 Offline
---
Posts: 2,463
Joined: Apr 2009
Reputation: 7
Location: Sweden
Post: #11
Are you considering this? (I could be totally of here since I don't understand the existing scrapers.)

I just think it would be nice if the scraper had a python wrapper around the regex where you could add some post and possible pre handling.

Like: Oh, you didn't find it, then we do this.

This was actually what I expected when I first looked inside beleiving I could customize it.
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #12
it's one of those things that would be nice if somebody did, and very welcome, yet nothing we want to focus on since what we have works well.
find quote
vikjon0 Offline
---
Posts: 2,463
Joined: Apr 2009
Reputation: 7
Location: Sweden
Post: #13
Thanks. Then I know. (not a realistic project for myself).

In my case an alternative would be to improve the sorttv script to be more creative in analysing the tvdb and make sure we get a good file name before the scraping.
find quote