2008-12-04, 00:59
After helping artik with his Excalibur Films scraper, I learned a lot more about regexp coding and scrapers in general so I was able to finish my AdultDVDEmpire scraper. This scraper will retrieve the following info:
- Film Title along with box cover.
- Production year and film studio.
- Rating (which should always be XXX, but I set it to pull anyway).
- Film Director
- Film Genres/Categories (All categories are pulled if a film fits into multiple ones).
- Film Actors/Actress along with thumbnails for each star if available.
- Film runtime.
- Film plot/tagline. *
* About the plot and tagline. Some films have just plots, some have just taglines, and some have both. Since the taglines come first in the code, I've set it to try and pull just a tagline first then it tries to pull just a plot, then tries to pull a plot if it has a tagline before it in the code. This should work most of the time but their are a few films which it fails on. Another thing is it will fail to pull the complete plot if the plot itself has a '<' bracket in it. For instance: "This here is a plot about an <b>AWESOME</b> movie!" There are a few plots like that and in that case it will pull everything up to 'an'. I tried to figure out a way around this but couldn't and finally settled with just having it pull as much as it can if it has a case like that. If it fails otherwise, it's most likely due to that particular movie having weird coding (which I've also run into during testing).
This script pulled a good majority of my collection on first try. I do have a few low budget films that it couldn't find but it was hard to even find those using google, so I'm happy either way. Here's some shots taken off my xbox with the MediaStream skin and the script:
To spiff, if you read this: I submitted a ticket for this already.
- Film Title along with box cover.
- Production year and film studio.
- Rating (which should always be XXX, but I set it to pull anyway).
- Film Director
- Film Genres/Categories (All categories are pulled if a film fits into multiple ones).
- Film Actors/Actress along with thumbnails for each star if available.
- Film runtime.
- Film plot/tagline. *
* About the plot and tagline. Some films have just plots, some have just taglines, and some have both. Since the taglines come first in the code, I've set it to try and pull just a tagline first then it tries to pull just a plot, then tries to pull a plot if it has a tagline before it in the code. This should work most of the time but their are a few films which it fails on. Another thing is it will fail to pull the complete plot if the plot itself has a '<' bracket in it. For instance: "This here is a plot about an <b>AWESOME</b> movie!" There are a few plots like that and in that case it will pull everything up to 'an'. I tried to figure out a way around this but couldn't and finally settled with just having it pull as much as it can if it has a case like that. If it fails otherwise, it's most likely due to that particular movie having weird coding (which I've also run into during testing).
This script pulled a good majority of my collection on first try. I do have a few low budget films that it couldn't find but it was hard to even find those using google, so I'm happy either way. Here's some shots taken off my xbox with the MediaStream skin and the script:
Code:
<scraper name="Adult DVD Empire" content="movies" thumb="adultdvdempire.jpg">
<NfoUrl dest="3">
<RegExp input="$$1" output="<url>http://www.adultdvdempire.com/itempage.aspx?item_id=\1</url>" dest="3">
<expression noclean="1">adultdvdempire.com/itempage.aspx?item_id=([0-9]*)</expression>
</RegExp>
</NfoUrl>
<CreateSearchUrl dest="3">
<RegExp input="$$1" output="<url>http://www.adultdvdempire.com/SearchTitlesPage.aspx?SearchString=\1</url>" dest="3">
<expression noclean="1"></expression>
</RegExp>
</CreateSearchUrl>
<GetSearchResults dest="6">
<RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="6">
<RegExp input="$$1" output="\1" dest="4">
<expression><a href="itempage.aspx?item_id=([0-9]*)[^>]></expression>
</RegExp>
<RegExp input="$$1" output="<entity><title>\2</title><url>http://www.adultdvdempire.com/itempage.aspx?item_id=\1</url></entity>" dest="5">
<expression repeat="yes">ListItem_ItemTitle"><a href=[^=]*=([0-9]*)[^>]*>([^<]*)</expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>
</GetSearchResults>
<GetDetails dest="3">
<RegExp input="$$5" output="<details>\1</details>" dest="3">
<RegExp input="$$1" output="<thumb>http://images2.dvdempire.com/res/movies/\1h.jpg</thumb>" dest="5">
<expression>BoxCover_Container">[^>]*><img src="http://images2.dvdempire.com/res/movies/([^m]*)</expression>
</RegExp>
<RegExp input="$$1" output="<title>\1</title>" dest="5+">
<expression>Item_Title">([^<]*)</expression>
</RegExp>
<RegExp input="$$1" output="<studio>\1</studio>" dest="5+">
<expression>StudioProductionRating">([^<]*)</expression>
</RegExp>
<RegExp input="$$1" output="<year>\1</year>" dest="5+">
<expression>Year: ([0-9]*)</expression>
</RegExp>
<RegExp input="$$1" output="<tagline>\1</tagline>" dest="5+">
<expression>InfoTagLine">([^<]*)</expression>
</RegExp>
<RegExp input="$$1" output="<plot>\1</plot>" dest="7">
<expression clear="yes">Item_InfoContainer">[^ ]*([^<]*)<</expression>
</RegExp>
<RegExp input="$$1" output="<plot>\1</plot>" dest="5+">
<expression>Item_InfoContainer">[^>]*>[^<]*</span>[^ ]*([^<]*)<</expression>
</RegExp>
<RegExp input="$$1" output="<actor><name>\2</name><thumb>http://images.dvdempire.com/pornstar/actors/\1.jpg</thumb></actor>" dest="5+">
<expression repeat="yes">cast_id=([0-9]*)[^t]*type=1"[^>]*>([^<]*)</expression>
</RegExp>
<RegExp input="$$1" output="<genre>\1</genre>" dest="5+">
<expression repeat="yes">media_id=[^i]*item_id=[^>]*>([^<]*)</expression>
</RegExp>
<RegExp input="$$1" output="<runtime>\1</runtime>" dest="5+">
<expression>>Length: ([^<]*)<</expression>
</RegExp>
<RegExp input="$$1" output="<mpaa>\1</mpaa>" dest="5+">
<expression>>Rating: ([^<]*)</expression>
</RegExp>
<RegExp input="$$1" output="<director>\1</director>" dest="5+">
<expression repeat="yes">type=4">([^<]*)</expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>
</GetDetails>
</scraper>
To spiff, if you read this: I submitted a ticket for this already.