[WIP] AniDB.net Anime Video Scraper

  Thread Rating:
  • 3 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
MukiDA Offline
Junior Member
Posts: 16
Joined: Dec 2009
Reputation: 0
Post: #1
I can't, for the life of me, figure out what's wrong with my scraper, though my initial guess probably has to do with my not properly using the "<url>" tag. With that in mind, I don't suppose I might be able to ask for a smidgen of help. I've been working on it with GVim and ScraperXML Editor, and the following comes up:
  • I've tested manually downloaded versions of the search results AND the individual result's info page. In both cases, ScraperXML Editor gives me a thumbs up; the results are correctly parsed
  • I can't test in ScraperXML from scratch (e.g. from a search query, that is), AFAIK. I have version 1.5
  • The old Scraper.exe program from way back when doesn't work with gzipped HTML. Is there an option to test it with downloaded html files?
Any help would be substantially appreciated. My script (modified from the animenfo.com script):

Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper name="AniDB.net" date="2009-11-15" content="tvshows" framework="1.0" thumb="anidb.jpg" language="">
  <NfoUrl dest="3">
    <RegExp input="$$1" output="\1" dest="3">
      <expression></expression>
    </RegExp>
  </NfoUrl>
  <CreateSearchUrl dest="3">
    <RegExp input="$$1" output="&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/animedb.pl?show=animelist&amp;adb.search=\1&amp;do.search=search&lt;/url&gt;" dest="3">
      <expression noclean="1"></expression>
    </RegExp>
  </CreateSearchUrl>
  <GetSearchResults dest="8">
    <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
      <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3&lt;/title&gt;&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
        <expression repeat="yes" noclean="1">&lt;a href="(animedb.pl\?show=anime&amp;amp;aid=([0-9]*))"&gt;([^&lt;]*)&lt;/a&gt;</expression>
      </RegExp>
      <expression noclean="1"></expression>
    </RegExp>
  </GetSearchResults>
  <GetDetails dest="3">
    <RegExp input="$$8" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
      <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="8">
        <expression repeat="yes">&lt;th class="field"&gt;Main Title&lt;/th&gt;.....&lt;td class="value"&gt;(.[^\n]*)</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;year&gt;\1&lt;/year&gt;" dest="8+">
        <expression noclean="1" trim="1">&lt;th class="field"&gt;Year&lt;/th&gt;.[^&gt;]*&gt;([^&lt;]*)|$</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;thumb&gt;&lt;url spoof=&quot;http://anidb.net&quot;&gt;http://animenfo.com/\1&lt;/url&gt;&lt;/thumb&gt;" dest="8+">
        <expression>&lt;div class="image".[^"]*"(http.[^"]*)</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;" dest="8+">
        <expression>animevotes&amp;amp;aid=[0-9]*"&gt;(.[^&lt;]*)</expression>
      </RegExp>
      <expression noclean="1"></expression>
    </RegExp>
  </GetDetails>
</scraper>
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #2
as far as i know the only available attributes for the url tag are

1: spoof="foo" (the referer page)
2. post="yes" (tells the http api to use POST method

and only used if calling a custom function (in GetDetails, GetSettings, or custom functions
3. function="CustomFunctionName"

I could be wrong on this, considering i don't fully understand the http in XBMC but i don't think there is a gzip="yes" option for url (i think xbmc automatically detects gzipped sites and decompresses them

ScraperXML Open Source Web Scraper Library compatible with XBMC XML Scrapers


I Suck, and if you act now by sending only $19.95 and a self addressed stamped envelop, so can you!

[Image: teamumx_sigline.png]
(This post was last modified: 2009-12-20 13:57 by Nicezia.)
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,187
Joined: Nov 2003
Reputation: 82
Post: #3
there is indeed a gzip="yes" parameter to enable gzipped content. this is useful in the cases where it's not set explicity in the http headers. in the latter case, it's handled automagically by curl.

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
MukiDA Offline
Junior Member
Posts: 16
Joined: Dec 2009
Reputation: 0
Post: #4
Is there any other clear mistake in my code, then? I can't figure out what's wrong, and AFAIK there's no testing programs out there (the only two I know about have the above-listed problems that prevent me from performing a complete search test)

Also, is there a way to specify "ignored" portions of my search query? At least for my specific use, I want it to ignore anything in the folder name inside parentheses.
(This post was last modified: 2009-12-20 21:54 by MukiDA.)
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #5
spiff Wrote:there is indeed a gzip="yes" parameter to enable gzipped content. this is useful in the cases where it's not set explicity in the http headers. in the latter case, it's handled automagically by curl.

good to know, i guess that's something i need to code for

ScraperXML Open Source Web Scraper Library compatible with XBMC XML Scrapers


I Suck, and if you act now by sending only $19.95 and a self addressed stamped envelop, so can you!

[Image: teamumx_sigline.png]
find quote
MukiDA Offline
Junior Member
Posts: 16
Joined: Dec 2009
Reputation: 0
Post: #6
I didn't know my version of the editor was so far behind. (caught the 3.5 link in your sig)
find quote
MukiDA Offline
Junior Member
Posts: 16
Joined: Dec 2009
Reputation: 0
Post: #7
Okay, now it almost works! This gets the thumbnail, year, title, and rating.

Herein lies the two questions I have to any other scraper writers:

1. How do I add episode titles?
2. How do I add "excluded" terms for the search query? I place format info (codec, audio tracks) in parenthesis in the folder name, so at least for the version of the scraper I keep personally, I'd like to remove all of that from the text that goes to the search query.
2.a. I tried adding the expression:
Code:
<CreateSearchUrl dest="3">
    <RegExp input="$$1" output="&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/animedb.pl\?show=animelist&amp;adb.search=\1&lt;/url&gt;" dest="3">
        <expression>(.[^)(]+)</expression>
    </RegExp>
</CreateSearchUrl>
2.b. Thanks to ackana on Freenode's Regex, I have an couple of nice regex codes, NEITHER of which work (and BOTH of which work in Scraper XML Editor):
([^)(]+)\([a-zA-Z0-9]+\)+
([^)(]+(?=(?:\([a-zA-Z0-9]+\))+))

Edit note : Okay, scrap.exe works (as far as getting a useful search URL, obviously the g-zip's a no-go) with:
([^)(]+)
What am I doing wrong?! Is there ANY way to figure out WTF XBMC is doing with this expression?! >_<

Code:
<?xml version="1.0" encoding="utf-8"?><scraper framework="1" date="2009-11-15" name="AniDB.net" content="tvshows" thumb="anidb.jpg" language="en">
    <NfoUrl dest="3">
        <RegExp input="$$1" output="\1" dest="3">
            <expression></expression>
        </RegExp>
    </NfoUrl>
    <CreateSearchUrl dest="3">
        <RegExp input="$$1" output="&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/animedb.pl\?show=animelist&amp;adb.search=\1&lt;/url&gt;" dest="3">
            <expression>([^)(]+)</expression>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="8">
        <!--     Multiple Results  -->
        <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3&lt;/title&gt;&lt;url gzip=&quot;yes&quot;&gt;http://anidb.net/perl-bin/\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes" noclean="1">&lt;a href="(animedb.pl\?show=anime&amp;amp;aid=([0-9]*))"&gt;([^&lt;]*)&lt;/a&gt;</expression>
            </RegExp>
            <expression noclean="1"></expression>
            
            <!--     Only one Result  -->
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\1&lt;/title&gt;&lt;url gzip=&quot;yes&quot;&gt;\2&lt;/url&gt;&lt;/entity&gt;" dest="5+">
                <expression repeat="no" noclean="1">&lt;th class=&quot;field&quot;&gt;Main Title&lt;/th&gt;.....&lt;td class=&quot;value&quot;&gt;(.[^\n]*)....&lt;a class=&quot;shortlink&quot; href=&quot;(http.[^&quot;]*)</expression>
            </RegExp>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="3">
        <RegExp input="$$8" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="8">
                <expression repeat="yes">&lt;th class="field"&gt;Main Title&lt;/th&gt;.....&lt;td class="value"&gt;(.[^\n]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;year&gt;\1&lt;/year&gt;" dest="8+">
                <expression trim="1" noclean="1">&lt;th class="field"&gt;Year&lt;/th&gt;.[^&gt;]*&gt;([^&lt;]*)|$</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;thumb&gt;\1&lt;/thumb&gt;" dest="8+">
                <expression>&lt;div class="image".[^"]*"(http.[^"]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;" dest="8+">
                <expression>animevotes&amp;amp;aid=[0-9]*"&gt;(.[^&lt;]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="8+">
                <expression>class=&quot;desc&quot;&gt;(.*)&lt;/div&gt;</expression>
            </RegExp>

            <expression noclean="1"></expression>
        </RegExp>
    </GetDetails>
<!-- Created with ScraperXml Editor -->
</scraper>

EDIT : Solved how to deal with a single search result that defaults to the info page. Just add a regex to "getsearchresults" that looks specifically for the info page. Not sure what to do if the info page doesn't have a link to itself like it does on AniDB, tho.
(This post was last modified: 2010-01-01 02:49 by MukiDA.)
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,187
Joined: Nov 2003
Reputation: 82
Post: #8
$#pages+1 holds the url to the page that is scraped for this very reason.

cleaning filenames has nothing to do with the scraper. see <cleanstrings> or something like that in advancedsettings.xml.

no idea why those expressions doesn't work and way too january 1. in my head atm Wink

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
eldon Offline
Junior Member
Posts: 14
Joined: Jan 2010
Reputation: 0
Post: #9
hi,
i'm trying to improve that anidb scraper a bit and was wondering if the wiki regex info are still correct or not.
i'm on linux using the 9.11 xbmc and it looks like regex lazyness do work although it was stated not working in the wiki.
Can i assume it is now working and plateform independant or should i stick to painfull lazyness free regex ?
And do you hints on the current regex version, and limitations, running in xbmc, in order to clear things up for my quick shot at this scrapper ?

thx
find quote
zosky Offline
Member+
Posts: 302
Joined: Dec 2008
Reputation: 1
Location: toronto. canada
Post: #10
MukiDA, props for your effort. i been dreaming of the link between aniDB and xbmc for a while.

out of ~43 (on my NAS) theTVdb has found all but 1,
i begrudgingly added it a few days back, but this makes adding new stuff a MASSIVE effort
& now i have another (missing from theTVdb) Sad

once you're ready to beta can i give it a shot ?
(im afraid that's as helpful as i can be in this situation Tongue )

[Image: nHr0q]

8 xbmc frodo's (rc3) ... atom/ion(xbmcBuntu) | amd3000/nVidia9800(mint12) | iphone4(5.1.1) | hpTouchPad(cyanogen9) ... and more
+ central mySQL db + 11TB mdadm raid6 (+transmission-daemon & flexGet)
+ maraschino + IRtoy.v2
[Image: widget]
find quote
Post Reply