Scraper abstraction ideas and code
#1
Lightbulb 
I mentioned a little while ago about wanting to be able to write Python scrapers. Toward that end, I've been doing some work on cleaning up the general CScraper interface so it's less dependent on having XML scrapers. With my changes, CScraper::Run is replaced by a set of functions that return higher-level objects. (Trac ticket #10078.)

For example: OLD: clients that wanted to match an album call scraper.Run("CreateAlbumSearchUrl") with the album/artist name, parse the returned URL, pass it to scraper.Run("GetAlbumSearchResults"), which would return XML with titles and URLs that the caller parses into CMusicAlbumInfo objects and then presents to the user for selection, and then scraper.Run("GetAlbumDetails") with the user's selection, and the returned XML is parsed into a CAlbum object.

NEW: clients call scraper.FindMovie(title), which returns back a vector of CMusicAlbumInfo objects, which the user selects from. The CScraperUrl of the selected item is passed to scraper.GetMovieDetails and a CAlbum is returned.

The new implementation of CScraper::FindMovie calls both the CreateAlbumSearchUrl and GetAlbumSearchResults scraper functions, but it wouldn't have to for, say, a Python scraper, which could do its own fetching via callbacks. Also, a Python scraper would more naturally return a list of objects (or dicts) rather than XML. The next step to accomplish this is to either make the CScraper media functions pure virtual and create various overrides, or use a bridge pattern. I’d also be interested in working with jfcarroll’s multi-language addon code if possible, so that other languages can be used.

One question that I ran into was testing. At present there don't seem to be any automated scraper tests. I would like to build a (Python) framework for such testing (and perhaps integrate it into 'make test' later), but then I have two problems: (1) it's hard to run scrapers if XBMC isn't running, and (2) insufficient test data.

For the first problem, I would like to move the scraper code (everything in the modified CScraper "and below") into a library (probably dynamic, but static would be fine too). This will allow a test harness (and other clients) to load it and pass it data. There appears to be a lot of dynamic library code in the source tree - presumably I'd want to create a wrapper class inheriting from DllDynamic and load it with one of DllLoader/SectionLoader/DllContainerLoader at an appropriate point.

For the second, I hope I can ask for and get test data from people, when I'm ready for it - that would mean directory listings (no actual media files) and matching music/video database dumps - and pull together the results of successful scrapes into test data, which I can run against the current XBMC, XBMC with my modifications, and any Python conversions of existing XML scrapers. I would also fetch and save a copy of the web pages used for testing as part of the test data since (1) web sites change and (2) it's not good citizenship to hit a search site too many times just to run tests (of course, some optional live tests would be a good idea).

Something else to throw out: when Python scrapers are possible, and presuming I successfully convert the current set of XML scrapers to equally-efficient Python with the same results, is there a good reason to keep the XML scraper system around? One might be "It's easier to teach people XML than Python", but I don't think that's terribly valid as anyone that can learn the necessary details about XML, functions, variables, regexes, chaining, etc. to write an XML scraper should be able to adapt the necessary regexes to a Python template - only a subset of real programming would be required for most.
Reply
#2
Hi there

I'm currently working on reworking the scanning interface - it too is to be somewhat distinct from the rest of the XBMC codebase where possible so as to allow testing outside of the XBMC tree. I'm onto about design 9 at this point - lots of things we currently support that needs to keep working plus lots of new things we want to be able to support. I'd very much like that to be language agnostic as well. This ofcourse is pre-scraper (i.e. reading the filesystems, figuring out which files are media worth scanning, getting local information if available, and generating a set of search terms) but I think it makes sense to consider it all before diving in too deeply.

If you're interested I could email you my ideas and we could get together, perhaps with jfcarroll and any team members who are interested and hash some ideas out?

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#3
jmarshall Wrote:I'm currently working on reworking the scanning interface - it too is to be somewhat distinct from the rest of the XBMC codebase where possible so as to allow testing outside of the XBMC tree. I'm onto about design 9 at this point - lots of things we currently support that needs to keep working plus lots of new things we want to be able to support. I'd very much like that to be language agnostic as well. This of course is pre-scraper (i.e. reading the filesystems, figuring out which files are media worth scanning, getting local information if available, and generating a set of search terms) but I think it makes sense to consider it all before diving in too deeply.

If you're interested I could email you my ideas and we could get together, perhaps with jfcarroll and any team members who are interested and hash some ideas out?

Sure, that sounds great. Modularization should ideally be done the same way across various potential modules.
Reply
#4
i strongly object to removing the xml based scrapers. you will never ever be able to get the same performance from interpreted code (not that the performance will be a problem). plus why remove it? personally i only touch python if i absolutely have to.
Reply
#5
Latest fix killed compiling...

Code:
MusicInfoScraper.cpp:193: error: ‘ClearCache’ is not a member of ‘CScraperParser’
Reply
#6
my bad, my svn at work is a mess. fixed now, dbrobins was completely innocent.
Reply
#7
spiff Wrote:i strongly object to removing the xml based scrapers. you will never ever be able to get the same performance from interpreted code (not that the performance will be a problem). plus why remove it? personally i only touch python if i absolutely have to.

I did think of a reason that the XML scrapers are superior: security - they limit very carefully what can be done - although I would imagine Python has some way to also run in a sandbox, like Perl has the Safe module, and there are plenty of Python addons and plugins already. And it's true there's no need to remove it.

I would guess that the performance of the code is overshadowed by the network I/O time, and the Python module should only need to be compiled once (or once per use?) if .pyc files are working properly. If I may ask, why the dislike for Python? I find it quite elegant and develop a good number of tools in it, even though I used to prefer Perl for that.
Reply
#8
I don't think there's any real reason to remove working code unless there is another method that offers considerable advantages.

I also don't think that scraping in python would be any slower that doing it via the regexps scraping stuff - most of the time you're waiting on network I/O.

Agreed that python is quite a good language for doing this sort of thing - good regexps, XML parsing and so on. The HTTP fetching and caching should probably be done internally to XBMC, however, to ensure that it is done using the appropriate proxies and the like.

Another improvement that would greatly simplify (and robustify - a new word Wink ) thetvdb and themoviedb would be supporting XLST transformation of the returned XML - after all, all we're doing is translating XML in a lot of those scrapers.

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#9
jmarshall Wrote:Agreed that python is quite a good language for doing this sort of thing - good regexps, XML parsing and so on. The HTTP fetching and caching should probably be done internally to XBMC, however, to ensure that it is done using the appropriate proxies and the like.

I had thought that it would make sense to add an internal xbmc.scraper module which could be used to do URL fetches using the usual XBMC mechanism (and UI). This seemed kinder (and to allow more flexibility) to the Python script writer than doing the equivalent of what the XML scraper does, that is, having "get URL" and "get results" callbacks (which I understand makes sense for XML scrapers). It may amount to the same thing fetch-wise, but to me:

Code:
from xbmc.scraper import FetchURL
from urllib.parse import urlencode

def FindMovie(search):
  results = FetchURL("http://example.com/?q=" + urlencode(search), post=True)
  movies = []
  if results:
    # parse returned HTML into movies list
  return movies

makes more sense and is easier to write than:

Code:
import xbmc.scraper
from urllib.parse import urlencode

def GetSearchUrl(search):
  return <url post=\"true\">http://example.com/?q=" + urlencode(search) + "</url>"
  # or even
  return xbmc.scraper.Url("http://example.com/?q=" + urlencode(search), post=True)

def GetSearchResults(content):
  movies = []
  # parse HTML into movies list
  return movies

especially when the scraper wants to use an XMLRPC query, or read from a local or LAN database, etc.

Quote:Another improvement that would greatly simplify (and robustify - a new word Wink ) thetvdb and themoviedb would be supporting XLST transformation of the returned XML - after all, all we're doing is translating XML in a lot of those scrapers.

Are you envisioning a new kind of scraper or a new feature for the XML scraper? An XSLT transform would be easy to do in a programming language - most have efficient modules for it - rather than bolting yet one more thing onto the XML scraper framework.
Reply
#10
Yup - that was exactly what I was thinking - a simple module to handle URL gets/posts etc.

The API for the scraper is basically just 2 functions:

1. Lookup using foo, where foo might be an object containing title, year and other possible things to match on. It returns a list of search results, with optionally some matching criteria (eg score between 0 and 1).

2. Get details for a particular item, possibly restricting the data to fetch based in some way (I want plot from imdb, everything else from tmdb, plus a bunch of images from <insert other scraper>).

As for XLST, it could be done using either the XML scheme or any other scheme/language. If done using the XML scheme then we'd need an improved XML parser in XBMC, which wouldn't necessarily be a bad thing.

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#11
I noticed the first 7 patches attached to ticket 10078 have been committed; how about the last 2, especially as #8 contains the main scraper abstraction change? Anything I can do to help get them in?
Reply
#12
iirc i had some concerns with one of them, and didn't have time to review the big one.
Reply
#13
spiff Wrote:iirc i had some concerns with one of them, and didn't have time to review the big one.

I understand we all have day jobs/school. Just so long as it's not abandoned. When you get to it, let me know your concerns and I'll update the patch(es) if needed.
Reply
#14
Great work dbrobins, I love to have python scraper. I don't think you need to make every scrapers. As soon as it is possible to do, people will write python scraper as they need.
Reply

Logout Mark Read Team Forum Stats Members Help
Scraper abstraction ideas and code0