Anime News Network Scraper (Release?)

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
volforto Offline
Junior Member
Posts: 17
Joined: Jan 2010
Reputation: 5
Post: #1
I have been working on a scraper for Anime News Network. Initially I was going to use Google for the searching, since ANN already uses it for its search. However, I learned from this bing thread that Google does not allow scraping. So I am using Bing instead, using the AppID from that same thread. I am not too sure what is ANN's policy for scraping, but it seems they don't mind (from this thread 5 years ago).

TV Shows: v0.46 download (xml+jpg)
Movies: v0.12 download (xml+jpg)

Settings for TV Shows scraper:

Enable All Language Casts
Retrieve other language voice actors in addition to Japanese.

Enable Unlisted Specials / 1 Episode OVA Workaround
Allow the same amount of special episodes as normal season 1 episodes. So if the series has 26 normal episodes, then you can also include 26 specials (season 0). These episodes will not have any title and will be named "Special Episode" with "Special" air date. This workaround also allow you to include OVA with a single episode (eg. Hoshi no Koe) which ANN will not include episode listing. Just name it 0x01.

Enable TVDB Fanart
Retrieve fanarts from TVDB using the main title from ANN. Matches with the same premiere date as the one listed on ANN will be preferred.

Include Alternative Titles in Fanart Search
In addition to the main title, also search with all the alternative titles listed on ANN.

Enable TVDB Banner (With ANN Thumbnail Fallback)
Get banners from ANN using the main title.

Enable TVDB Poster (With ANN Thumbnail Fallback)
Or get posters.

Enable TVDB Episode Details (Using Episode Title Matching)
Retrieve episodes overview and other details from TVDB. The matches are done by comparing the episode title (rather than episode number).

Movies scraper has some of the same settings with TMDB, but TMDB's search doesn't function so well, so fanart search will fail more often.

UPDATES:
2010/03/26:
TV: Adapted scraper to ANN's new html.
TV: Fixed a bug where the scraper tries to get fanart no matter the setting.
MOVIE: Fixed a small bug introduced in v0.11.
2010/03/17:
Recovering from the missing information from old forum backup.
-----------
2010/02/08 - 2010/03/16:
Bunch of changes during this period.
-----------
2010/02/08:
Changed method for ANN thumbnail fallback (to fix a possible bug)
Changed fanart code to include fanarts from all alternative names, instead of just the first one with fanarts
Fixed some exception cases for the voice actors scraping
2010/02/04:
Updated scraper to work with ANN's new html for the casts section
(This post was last modified: 2010-03-27 05:49 by volforto.)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #2
Smile)

maybe we can finally shut you anime fans up ;P

as for 2), i don't see how a bunch of functions would be required. this should do it

Code:
<GetSearchResults>
<url cache="something">..</url>
        ^^ to avoid fetching the same page more than once
<GetDetails>
...
<url function="gettvdbthumb">searchforthumb</url>
</GetDetails>

<gettvdbthumb>
  <RegExp ..><expression>matchthethumb?</expression></RegExp>
  conditionally push <url function="getannthumb" cache="something">somerandomcrap,cachewilloverride</url>
</gettvdbthumb>

<getannthumb>
<details><thumb>..</thumb></details>
</getannthumb>

get my drift?
(This post was last modified: 2010-02-03 11:56 by spiff.)
find quote
volforto Offline
Junior Member
Posts: 17
Joined: Jan 2010
Reputation: 5
Post: #3
Thanks spiff! I didn't know cache can be used in such a way. I have added the functionality to the scraper and updated the first post. Now there will always be a thumbnail on any anime series.

With eldon's release of his anidb scraper, there should be more options for scraping anime now.
find quote
volforto Offline
Junior Member
Posts: 17
Joined: Jan 2010
Reputation: 5
Post: #4
Added the functionality to also search TVDB fanart using the alternative titles listed on ANN. I found I needed this function for some of the series.

I think this is it in terms of adding functionality (as the most important part for me is getting fanart/banner/thumbnail). Maybe others can improve it if they found a bug or want to add something Smile
find quote
nuclearsunshine Offline
Junior Member
Posts: 2
Joined: Feb 2010
Reputation: 0
Post: #5
I just tried this and it lumped all the episodes for all my anime under the first show the scraper identified.
find quote
volforto Offline
Junior Member
Posts: 17
Joined: Jan 2010
Reputation: 5
Post: #6
Really? That's odd. I am not getting that. Does it work with the normal TVDB scraper before this? The only way I can think of that would cause this is if your XBMC uses cache differently from mine (which delete cache on searching each TV series).

P.S. Updated to v0.31 because ANN changed the html for their casts section. I also took the opportunity to include actors with multiple roles (possible now with the new html).
(This post was last modified: 2010-02-05 09:17 by volforto.)
find quote
nuclearsunshine Offline
Junior Member
Posts: 2
Joined: Feb 2010
Reputation: 0
Post: #7
I haven't tried the new version, but everything works fine with TheTVDB and the anidb.net scrapers.
find quote
volforto Offline
Junior Member
Posts: 17
Joined: Jan 2010
Reputation: 5
Post: #8
This is definitely a cache thing then. In order to fallback to ANN thumbnail I am caching the ANN page. Since I don't know a way to pass the ID to that sub-sub-sub-function, I am using a generic details.html as the cache name. It sounds like your version of XBMC is keeping that cache across series.

I am not sure how to fix that, unless I remove the thumbnail fallback. Perhaps there's a way to pass parameters to the functions?

EDIT:
I was doing some research and found the clearbuffers="no" option from a previous thread. I can't believe I missed it. Maybe I will be able to do something with it.
(This post was last modified: 2010-02-08 02:31 by volforto.)
find quote
TREX6662k5 Offline
Donor
Posts: 214
Joined: Oct 2006
Reputation: 0
Location: London, United Kingdom
Post: #9
Handy thank you!

WYSIWYG
find quote
volforto Offline
Junior Member
Posts: 17
Joined: Jan 2010
Reputation: 5
Post: #10
TREX6662k5 Wrote:Handy thank you!

Thanks Smile


I have updated the code which hopefully fixed the problem of the lumped series. It turns out there's another simple method for the thumbnail fallback, so I don't even need the clearbuffers="no" option. I used it anyway to update the functionality of the fararts grabber.

Also fixed some specials cases for the getting the casts info.
find quote
jsc315 Offline
Junior Member
Posts: 10
Joined: Oct 2009
Reputation: 0
Post: #11
thanks Smile
find quote
volforto Offline
Junior Member
Posts: 17
Joined: Jan 2010
Reputation: 5
Post: #12
Trying to recover some of the first post info from the old forum backup.

Also, the scrapers have been submitted as ticket 8961 about a week ago.
http://trac.xbmc.org/ticket/8961
find quote
Zarbis Offline
Junior Member
Posts: 17
Joined: Nov 2009
Reputation: 0
Post: #13
I've made a comparison of ANN shows and TheTVDB.com scrapers on my anime archive.

How i did measure scraper's quality?
Scraper found title - 1 point
Scraper found fan art - 3 points
One of three:
Scraper found banner - 3 points
Scraper found thumb - 2 points
Scraper only got fallback thumb - 1 point

So scraper can get maximum 7 points per title, 399 points for all 57 titles.

The most strange in ANN scraper's behavior was that it always downloads thumb, even if i choose banner, may be i did something wrong and that's why test result became irrelevant.

I've made full archive scan with full database removal before scan for each scraper.

Here is results:

Fully automatic scan:
ANN: 302 points
TheTVDB.com: 323 points

With manual corrections:
ANN: 302 points
TheTVDB.com: 373 points

However ANN scraper automatically found 55/57 titles, when TheTVDB.org found only 47/57. And the main reason why ANN got less points - as I mentioned before, it downloads only banners. So I hope it's a trivial bug or my fault.

Disclaimer:

First of all testing titles are not ideally representative , it's just my 400 GB of anime.
Second: points system is subjective and based on my tastes, if someone prefers banners over posters or thinks 3 points for fan art is overrated - results will change significantly.

P.S. However I think ANN scraper worth including into official XBMC distribution. It gives needed functionality to deal with specific media formats rather then generic TV shows and helps to solve anime-specific problems which will never happen in generic TV shows scraping.

P.S.S. Spreadsheet with comparison results: http://dl.dropbox.com/u/459039/comparison.xls
(This post was last modified: 2010-03-18 17:34 by Zarbis.)
find quote
volforto Offline
Junior Member
Posts: 17
Joined: Jan 2010
Reputation: 5
Post: #14
Thanks for the comparison, Zarbis Smile It certainly looks useful. It's good to see the TV scraper can find most title automatically (even some movies titles by the look of it).

I am not too sure I understand the problem you mentioned. Are you saying you have selected the option "Enable TVDB Posters" and the scraper ended up getting banners?
find quote
Zarbis Offline
Junior Member
Posts: 17
Joined: Nov 2009
Reputation: 0
Post: #15
volforto Wrote:I am not too sure I understand the problem you mentioned. Are you saying you have selected the option "Enable TVDB Posters" and the scraper ended up getting banners?
Exactly, both "Enable TVDB ***" is getting banners. I will try to reproduce that behavior and if i will succeed i will try to provide more information.

I confirm that i've done something wrong, rescanned my archive and got posters, not banners, now results are:

Fully automatic scan:
ANN: 369 points
TheTVDB.com: 323 points

With manual corrections:
ANN: 369 points
TheTVDB.com: 373 points

ANN scraper now really close and found all of 57 test titles. However it looks strange, that ANN scraper haven't found posters for 10 titles, which TheTVDB.com found.
Any way I personally now would use this scraper for anime content, because it gives much more relevant results with no need of making manual corrections of silly search requests (e.g. To Aru -> Toaru and some more).

P.S. Updated table: http://dl.dropbox.com/u/459039/comparison.xls
(This post was last modified: 2010-03-19 15:11 by Zarbis.)
find quote
Post Reply