[Proposal] Seperating the scraper in a library.
#1
Hi all,

I'm a Belgian Computer Science student ( 1st Master ) at the university of Ghent. I have participated in GSoC successfully 2 years now and I really love the program. It's allows me to code the entire summer without having to worry about making some money for next year and I get the opportunity to get into some new FOSS projects and get to know interesting people.

I've been using XBMC for almost 3 years now I think and I'm a bit obsessive about my media-library which led me to the irc channel a few months ago asking around and concluding the scraper support in a library would help some of my problems. People told me then the idea has been around a while and I had a quick look at the code but didn't continue because I got some payed work. GSoC and the summer would be my opportunity to finally get into the XBMC code-base for real.

Proposal
The main idea is to separate the scraper support in library. There are several sub parts, the intelligent file-iteration, the xml-scraper parsing, the callbacks that implement the xml-scrapers functionallity, the acquiring of the data and finally storing the results.

My proposal will focus on the middle two parts and when I have time left I will include the first. I would include some new features and options for the xml-scrapers to make them even more powerful. For example there is the idea of 2-way communication allowing the scraper to provide feedback to a back-end ( for example enabling the users's choices about what file is what movie to be used for better predictions, some recommendation systems,... ) I would also like to see scraping be a lot more modular so people can "put together" their own custom scraper from prefab blocks. People could then use "small scraper blocks" combined. Ratings, posters, fan-art, plot, actors,... all this are to be small blocks which could get there data from different sources. I might like the IMDB's ratings but the plot's from RT and the posters from somewhere else and there could be no existing scraper having this combination. This would allow such user-defined combinations. It would also allow to only rescrape ratings or plots or,....

I have it worked out a lot better and more technical but I don't think it's interesting for all so if you have any questions feel free to pm me or pm me for my e-mail address so we can mail.

Benefits
There really are a huge number of benefits from having a library for this. I'll name some but the list surely isn't definitive.
  • It will result in more modular, less inter-woven code in xbmc's codebase.
  • Other applications ( utility tools,... ) could use the library and it's bindings to python/java/... to support the same xml-scrapers and have the same scraper functionallity as xbmc does. This will enable a more uniform way of scraping and tools like Ember would do things 100% the XBMC's way. The resulting data should be 100% XBMC-compatible. The tools would support our regular scrapers!
  • Scraping could be done off-site. Your NAS/Fileserver could do the scraping when a new file gets in even if your mediacenter isn't powered on ( the data could go in the mysql database if enabled or otherwise stored until the mediacenter pc is on ). Nice browser-libraries could be made showing your media-library with all the meta-data xbmc would have without your system being on.
  • Modular scrapers have a lot of benefits some of which were mentioned above and other I won't all list here.
  • Two-way communication..
  • If I get to the file iteration part I would make this smart enough so XBMC would only need one path anymore and no longer an indication of what data is there to find ( movies, shows, music,.. )
  • The whole library-idea fits perfectly in the unified back-end movement documented on the wiki.
  • ...

I hope dev's ( and users ) who read this are as excited as I am about this project. Please feel free to ask any question over here, suggest extra features, provide feedback. I have only my experience with the media library so there might be awesome features I just don't think of because I'd never use it.

Greets,
Sander
Reply
#2
Hey Sander,

One thing you'll want to include in your official proposal to GSoC is which projects you've been involved in in the past - if you have evidence of previous code done etc. that'd be great as well.

All the best!

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#3
Hi,

Thanks a lot for your feedback jmarshall! This post is a summary of what I want to do with xbmc this summer in the hope of getting some feedback or extra feature requests. It's not my real proposal which will be more detailed, technical and will contains the things you asked too.

I'll try with some more concrete questions for other dev's and users, those would be easier to reply too I guess Smile

  1. Would you rather have a smart file scanner which needs only 1 path and detects if a file is a show or a movie or music or would you rather have the modular scraper idea implemented.
  2. Can you think of other cool stuff scrapers would be able to do with a feedback channel ( 2-way communication )?
  3. For all xml scraper developers out there: what features do you miss? Have you ever thought "if only I could...."
  4. Developers of supplementary tools, do you have a vision on what such a library would provide? How you could use it?

Again, feel free to comment on this in any way you can think of.

Greets,
Sander
Reply
#4
Wasn't there a plan to switch scrapers from .xml to python addons ?
I think I heard that somewhere.
Admin @ Passion-XBMC
(official french community)
Reply
#5
(2012-03-25, 14:19)Maxoo Wrote: Wasn't there a plan to switch scrapers from .xml to python addons ?
I think I heard that somewhere.

dbrobins had patches around to add scraper abstraction (see for instance: here )
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#6
(2012-03-25, 10:24)dzan Wrote:
  1. Would you rather have a smart file scanner which needs only 1 path and detects if a file is a show or a movie or music or would you rather have the modular scraper idea implemented.

I think a 'smart' file scanner would definitely benefit the average xbmc user the most. The learning barrier to set up paths is too big a hurdle to make use of the library feature for some.
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#7
(2012-03-25, 14:19)Maxoo Wrote: Wasn't there a plan to switch scrapers from .xml to python addons ?
I think I heard that somewhere.

It seems to me that the current XML scrapers work great and are very user-friendly. You don't need programming skills to create one, why would one replace it with python scrapers? Performance is not an issue, the XML on dictates the back end "where to look". But anyway, after I created the library there would be python bindings for it and so if there is the need this GSoC project would enable python scrapers anyway I guess.

(2012-03-25, 15:14)DonJ Wrote: dbrobins had patches around to add scraper abstraction (see for instance: here )

Thanks a lot for the link, I've read through it and will need to do so again but after a first read I noticed some major differences. First of all my primary goal is to create an totally separated library, not to improve the current architecture. The library would of course use the current architecture at first and improved versions as I progress, dbrobins's work will provide a lot of information for this purpose.

(2012-03-25, 15:17)DonJ Wrote: I think a 'smart' file scanner would definitely benefit the average xbmc user the most. The learning barrier to set up paths is too big a hurdle to make use of the library feature for some.

Thanks for this, if more people share your opinion and prefer this above extending scraper's functionality I will devote more time and work on this and less on modular scraping ( if time is an issue of course ).
Reply
#8
(2012-03-25, 19:09)dzan Wrote:
(2012-03-25, 14:19)Maxoo Wrote: Wasn't there a plan to switch scrapers from .xml to python addons ?
I think I heard that somewhere.

It seems to me that the current XML scrapers work great and are very user-friendly. You don't need programming skills to create one, why would one replace it with python scrapers? Performance is not an issue, the XML on dictates the back end "where to look". But anyway, after I created the library there would be python bindings for it and so if there is the need this GSoC project would enable python scrapers anyway I guess.

Scraper that need authentications e.g.. XML scrapers means you send the login/pass without encryption. That's dangerous. I don't understand everything you propose, but maybe your 2way communication could take care of that ?

Anyway, I was just mentioning python scrapers because I thought they were some coding already done and wouldn't want to see that wasted Wink
(plus it was relevant I think).

Admin @ Passion-XBMC
(official french community)
Reply
#9
I would love to see more sources for TV scrapers for American TV than TVDB. It's dangerous to rely on one resource only. It means that when that information is incorrect and a series is locked (as if often the case), there is no method to correct it other than relying on their admins. We were also all held without access to a scraper when they were down for a few days last year.

I'd love to see the tvrage.com API be used as well to pull episode data. Or IMDB (although IMDB officially prohibits screen-scraping and directs people to download and use daily FTP data dumps).

The Sickbeard development people are working on something at http://thexem.de to crossmap information from different sources - TVDB, AniDB, TVRage, and the Scene. It would probably be wise to jump in and collaborate on their work too.
Reply
#10
(2012-03-25, 19:21)Maxoo Wrote: Scraper that need authentications e.g.. XML scrapers means you send the login/pass without encryption. That's dangerous. I don't understand everything you propose, but maybe your 2way communication could take care of that ?

Anyway, I was just mentioning python scrapers because I thought they were some coding already done and wouldn't want to see that wasted Wink
(plus it was relevant I think).

Thanks, this is indeed another use of 2-way communication I hadn't yet thought of. I'll have a look at some of the common login scheme's and try to abstract it enough so it could be done from an xml scraper.

(2012-03-26, 05:15)ZebZ Wrote: I would love to see more sources for TV scrapers for American TV than TVDB. It's dangerous to rely on one resource only. It means that when that information is incorrect and a series is locked (as if often the case), there is no method to correct it other than relying on their admins. We were also all held without access to a scraper when they were down for a few days last year.

Hey, thanks for your reply. I understand the problem but this isn't really related to my project, you should find someone willing to make some xml scrapers for those websites ( unless this isn't possible with the current support for them in which case I'd love to know what's missing exactly in them ).

(2012-03-26, 05:15)ZebZ Wrote: I'd love to see the tvrage.com API be used as well to pull episode data. Or IMDB (although IMDB officially prohibits screen-scraping and directs people to download and use daily FTP data dumps).

Idem, this is one layer "above" what I would implement.

(2012-03-26, 05:15)ZebZ Wrote: The Sickbeard development people are working on something at http://thexem.de to crossmap information from different sources - TVDB, AniDB, TVRage, and the Scene. It would probably be wise to jump in and collaborate on their work too.

This seems like a very useful project, again it's mostly up to scraper developers to use it but it might come in handy when working on the modular scraping ( would have to ask around about depending on such a new initiative ). I could write an optional library function accepting the show's name and the desired source which would then return an url from that source instead of having to search the source ( ex. tvdb ) directly. So the first 'block' of xml scrapers could use it and wouldn't need to parse each specific website anymore... Good stuff to think about Smile
Reply
#11
i have POC (non-published) code for moving scrapers to python. while i love my own brainchild (xml based scrapers), there are simply much more human resources out there capable/interested in writing python based scrapers.

i use the current plugin mechanism, so doing lookups is completely done as vfs operations. this has several advantages, the most obvious one being that it's accessible to anything which can access our vfs, i.e. to python add-ons, to anything using json-rpc aso. basically, it's simply queries of the form

plugin://<scraperid>?action=<list>&title=<name>

with some of the calls tailored for tvshows, movies, albums, artists etc.

with one of the "emulators" for our python bindings out there, this approach should also be possible to take outside xbmc. the only con is that it will tie the system somewhat to xbmc internals. personally i don't think this is a big problem, since it's rather general in nature (even though the classes are called xbmc.xxx)
Reply
#12
With regards to XBMC requiring only one path, I imagine there will be a lot of people who have a media collection spanning several volumes and / or drives; would this still allow for multiple sources?
Reply
#13
(2012-03-26, 17:43)jim0thy Wrote: With regards to XBMC requiring only one path, I imagine there will be a lot of people who have a media collection spanning several volumes and / or drives; would this still allow for multiple sources?

Sure, those people could just enter all those paths. The point isn't as much that the data is in one path it's that you have to indicate what type of data it is and what scraper to use.



(2012-03-26, 14:57)spiff Wrote: i have POC (non-published) code for moving scrapers to python. while i love my own brainchild (xml based scrapers), there are simply much more human resources out there capable/interested in writing python based scrapers.
I'm sure you have most feeling with the community but I can just say from my own experience reading the wiki on xml scrapers once was enough to be able to create or adjust one... that's pretty user-friendly. While there certainly are a lot of coders around here not all of them know python.

Another advantage of the xml-approach is the fact it leaves less freedom.. this may seem odd but this way others can easily adapt existing scrapers, know how to read them and it has some benefits for security ( which isn't a real issue for xbmc I admit ).

I like the xml scrapers but if the dev's think this will be removed in the future it might be better for me to spend more time on other parts of the project of course.

(2012-03-26, 14:57)spiff Wrote: i use the current plugin mechanism, so doing lookups is completely done as vfs operations. this has several advantages, the most obvious one being that it's accessible to anything which can access our vfs, i.e. to python add-ons, to anything using json-rpc aso. basically, it's simply queries of the form

plugin://<scraperid>?action=<list>&title=<name>

with some of the calls tailored for tvshows, movies, albums, artists etc.

with one of the "emulators" for our python bindings out there, this approach should also be possible to take outside xbmc. the only con is that it will tie the system somewhat to xbmc internals. personally i don't think this is a big problem, since it's rather general in nature (even though the classes are called xbmc.xxx)
I haven't seen your code changes of course but if I understand it correctly they don't remove the need or advantages from my proposal? XBMC would still benefit from a total separation of the scraper code in a library, in a direct way and through the extra options it would leave for developers of third-party software and such. Also it would still help in the client/back-end movement.

XBMC would then just use the library tied to your changes.. Am I right? I could start separating and adding extra functionality for scrapers but wait with including XML-scraper support in the library until a decision is made? So the functionality implementing callbacks and classes would be made stand-alone anyway. I could spend more time on the file iteration/detection while waiting for the decision.

Thanks for your feedback and please continue to provide it Big Grin
Reply
#14
how we obtain the info is, indeed, completely orthogonal to your ideas. which a hint that supporting both should be doable under one (abstracted) interface.

i can, and will, be much more verbose at a later stage, but right now, it's important that we mentors don't detail the tasks too much. a part of the gsoc idea is that the ideas should come from you Smile
Reply
#15
(2012-03-26, 18:06)dzan Wrote: Sure, those people could just enter all those paths. The point isn't as much that the data is in one path it's that you have to indicate what type of data it is and what scraper to use.

Just thought I'd check. Smile You're right that it can be quite confusing for non-technically minded people to get their sources configured correctly.

What would be nice would be if XBMC could return a list via an API call of all the parameters it requires from a scraper. That way a possible scraper-building front end could be developed that would allow for the end-user to provide the path to the the data source's API, then match the returned values to the relevant XBMC fields. The aim being to automatically generate the scraper code for new data sources without the end-user having to get their hands dirty with the code.

Returning the parameters via an API call would allow for the front-end to keep itself in-sync with any changes to the XBMC library fields.
Reply

Logout Mark Read Team Forum Stats Members Help
[Proposal] Seperating the scraper in a library.0