2007-10-13, 16:05
Hi everybody,
I was thinking about developing a universal media content scraper plugin. This means a general framework using regular expressions to extract categories and media links from multimedia websites. XOT-Uzg brought me to this idea. The plugin should have an architecture similar to the imdb & tv.com scrapers for movie information. An XML file could carry all the regular expressions to extract media links from the html. The program would list all installed XML information files in the base directory, so if we had a file TVLinks.xml and Joox.xml installed, it would list TVLinks and Joox. If the user browses that directory, the plugin opens the websites and executes the regexp searches in the XML files and displays the resulting media/directory links. What do you all think of this idea? Is this XML regexp format really such a good idea or would you prefer to have a framework with python plugins like in XOT-Uzg? Would XML be too slow for a plugin? I would like to see this thread as a collective brainstorming of ideas for the framework. Any suggestions?
The plugin should:
I was thinking about developing a universal media content scraper plugin. This means a general framework using regular expressions to extract categories and media links from multimedia websites. XOT-Uzg brought me to this idea. The plugin should have an architecture similar to the imdb & tv.com scrapers for movie information. An XML file could carry all the regular expressions to extract media links from the html. The program would list all installed XML information files in the base directory, so if we had a file TVLinks.xml and Joox.xml installed, it would list TVLinks and Joox. If the user browses that directory, the plugin opens the websites and executes the regexp searches in the XML files and displays the resulting media/directory links. What do you all think of this idea? Is this XML regexp format really such a good idea or would you prefer to have a framework with python plugins like in XOT-Uzg? Would XML be too slow for a plugin? I would like to see this thread as a collective brainstorming of ideas for the framework. Any suggestions?
The plugin should:
- Extract media URLs (video and audio)
- Extract browsable categories
- Extract image information for the items
- Give developers the chance to develop media scrapers very fast, without all the python headaches. Basic regexp skills should be enough.
- Have a repositories of scraper XMLs, so people can install them via script automatically.
- Make user interaction possible (Virtual keyboard to enter a search term for youtube for example)
- Have an icon embedded as base64 in the XMLs for each scraper
- Set the user agent for urllib2 (some sites like joox.net block the urllib user agent)
- Offer the possibility for very complicated scrapers to utilize python scripts. This seems to be necessary for TVLinks, because they use a very bad kind of URL hiding (see the code of TVLinks plugin). Or is there a way to express these tasks without python with an XML rule?