[WIP] AniDB.net Anime Video Scraper - Printable Version
+- XBMC Community Forum (http://forum.xbmc.org)
+-- Forum: Help and Support (/forumdisplay.php?fid=33)
+--- Forum: Add-ons Help and Support (/forumdisplay.php?fid=27)
+---- Forum: Metadata scrapers (/forumdisplay.php?fid=147)
+---- Thread: [WIP] AniDB.net Anime Video Scraper (/showthread.php?tid=64587)
- eldon - 2010-01-04 20:28
just realised tvdb has a xml api so i'm making changes to use it...
Doing that i noticed there are additional covers by seasons. i'm only using single season animes for my testing and it looks like thumbs with a season attribute, whatever it is set to, will not be listed in the info "get thumb" dialog, should thumb season attributes be used or avoided, and how do you set the anime season property ?
i'm wondering if the principle of seasons really applies on animes anyways. From what i know most animes have only one season and when it's a returning serie either the title changes of the episode number grows, (ex: ghost in the shell sac, sac 2nd gig or naruto ep 1-220).
Any anime expert can comment on that to let me know if i should dig deeper into that ?
As side note, more of a feature request, i was wondering if there was any use for thumbs of the characters ?
They are usually available on anidb and they would look quite nice on the cast info dialog.
Finally, i should point out it would be nice if xbmc was a bit more relaxed regarding episode filenames, if it has the title in it and number possibly corresponding to an episode number (1 2 3 or 01 02 03..) it should be found. You can't always easily rename filenames, for example when downloading torrents you usually can't rename the files as you'd like, the strict use of s01e01 seems a bit harsh.
They could be simply substituted for the actors thumbs, maybe according to a skin parameter like "Use characters thumbs".
- Ommina - 2010-01-07 00:43
OK. As has already been mentioned in this very old thread, AniDB asks that you do not scrape their pages.
Again, to be clear, please do not scrape AniDB pages.
I appreciate that the xbmc framework provides convenient methods for doing so. I also understand (given an entire forum dedicated to scraper development) that is is a very much accepted method for gathering data. Just the same, please do not scrape AniDB data.
AniDB does offer two public APIs for retrieving data, the complexity of which varies depending on the richness of the data you wish you retrieve.
For your purposes here, the HTTP XML API probably provides what you want, including complete episode lists. To help find the correct aid for a given anime title, a complete dump of all anime titles, across languages is also available. This is updated daily.
If you're anxious to provide something especially in-depth, such as the MediaPortal plugin linked in the previous thread, the UDP API has been around for years.
Yes, I know that neither of these solutions is as convenient as using the scraping framework. However, they do both offer solutions that will not suddenly break one day, when AniDB makes internal changes. And there's a lot to be said for that.
- Ommina [AniDB]
- spiff - 2010-01-07 12:39
hurrah for you adding http xml api! the udp api is evil and why i abandoned your site many moons ago (plus the fact that nobody responded to my http/xml api inqueries). with that in place all i can promise is that a scraper not using your framework will never hit svn.
- eldon - 2010-01-07 12:46
well okay i guess that settles it then..
i can certainly use the xml http api but i don't know about client side caching. Scrappers caching seems to only last the time of a single update process.
On the wiki page, it looks like the studio and cast are missing from the api description, but that's not so bad, although xbmc can use cast names to cross reference xbmc library entries for a specific person which is quite nice.
Maybe someone can elaborate on another "time to live" based cache for the scrappers ?
But i fear that's only something that could be done through scripts or plugins.
It would at least most certainly be required to dowload the anime database in order to perform searches for aids.. or we could go through google of course as a second choice method.
Although cache purge doesn't seem to be handled by the scrapper itself, at least from what i saw in ScraperUrl.cpp, maybe the purge could honor some "time to live" in order not to remove the files right after the scrapping has been done.
We could do something like <url cache="file.xml" ttl="24">http://..</url>, ttl being set in hours, any devs around ?
make sure you read about the ban rules for the http xml api, it would definitely require some more advanced caching mechanism on xbmc's side.
And if you want me to do the xml scrapper i'm fine with it i'm already half way anyways.
- spiff - 2010-01-07 12:58
yeah, i saw that. i will try to cough up a better system for a more persistent scraper cache. let me worry about that bit, you just make sure you use cache files in your scraper wherever appropriate.
- spiff - 2010-01-08 00:39
http://forum.xbmc.org/member.php?find=lastposter&t=50055 delivered on my part
- eldon - 2010-01-09 11:38
delivered what, sorry don't quite understand what the link points to..
Anyways, i've got a bit of a problem, how can i get the current title being searched (passed by the search dialog from the filename or manual input) when i'm in GetSearchResults, or in any other part of the scrapper for that matter ?
i'm also having some trouble parsing the large anidb anime titles xml file, i'm looking for a way to iterate through <anime></anime> blocks to look for the search pattern but i can't find a way to do that.
the anidb animes titles data is as follows :
i've got a working regex but it takes for ages to run on that large file, i'd like to improve that by running a simpler regex through each single block.
What should be the correct <RegExp><expression> structure to perform such a task ?
- eldon - 2010-01-13 15:21
i have completed the scrapper but am still not able to use anidb titles database, as explained above.
i'm currently using google to avoid scrapping any anidb.net pages but if someone can help me build a correct <RegExp><expression> structure, it would make the whole scrapper anidb compliant.
Or let me know if it can't be done and i'll post the scrapper in its current state.
- spiff - 2010-01-13 15:22
i don't understand what you are trying to extract from those blocks.
- eldon - 2010-01-14 01:29
i'm simply trying to find the anime id based on the title xbmc found on the filename or directory name.
the match should be based on thoses "main" titles lines :
when you make a regex trying to scan the whole database file (2.2MB) it takes approx 30 sec to 1 min to find a match, probably a lot more if the entry is located farther as i didn't use "repeat" for my test.
so i'd like to break the database into <anime></anime> blocks and try to a title match for each of them. I do know how to assign a matching block to a buffer and then use that buffer for further parsing but i haven't found a way to repeat that on all the blocks.
But frankly i must say that even with such method i'm not really sure it'll be much faster and if not that would make the use of anidb titles database quite useless, or maybe only optional for those with serious muscle under their xbmc hood.
Google anidb title search works fine but anidb indexed page titles don't always have the actual anime title in them, only "Anidb.net".