As a first step I've created a script which will gather scraped data, this script will be released into the official repository soon I hope. The data this script will generate will be used to a) give us an idea of how well the old engine works and b) data to train a new engine upon. All data will be anonymous and I will create a blog post about it when it hits official repository.
So to highlight what I want to achieve with a new engine, what I think is of importance:
- It must be generalized, adding a new media type should be trivial, no part of the core should be bound by media types.
- What fields is of interest is not tied to the engine, any scraper may add metadata as it sees fit. The user of the engine (xbmc and skinners) may choose what data it understands but scrapers can emit all type of data it wants.
- Parallelism! As much as possible needs to be parallelism friendly, ideally not only between files but all parts of scraping of a file too.
- Everything is linked, a movie can have a soundtrack and game associated with it. The director of a movie can be the singer in a band and share photographs on a site.
I will go over my plan here in more details later but first I'd love to know from all current scraper creators what features they would like to see, and perhaps even more important what features they like with the old system.
I hope its going to be a great summer!