Adult DVD Empire scraper

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
Caliche Offline
Junior Member
Posts: 1
Joined: Jul 2007
Reputation: 0
Big Grin  Adult DVD Empire scraper
Post: #1
Hello, I've just created a scraper for adultdvdempire.com because I thought that jadedvideo is sometimes not the best source and we could use another one that offers more content.
This is the first time I work with regular expressions and it hasn't been very easy so far but I got most of the stuff avalaible it's on the scraper.
I based the scraper on the one from jadedvideo so I left stuff that either I do not know what it does or I just didn't understand.
The only real problem I had was the plot and the tag line, If somebody with more experience using regular expressions could take a look and see where I made the mistakes for the plot and tag line it be great.
Thanks.

Here is the code:
(I would attach the XM file but I don't see a way to do it on this board)

<!-- Basen On Scraper for Jadedvideo -->

<scraper name="DVD Empire" content="movies" thumb="dvdempire.jpg">

<NfoUrl dest="3">
<!--Don't know what this does but it's on the jadedvideo script so I included it!-->

<RegExp input="$$1" output="http://adult.dvdempire.com/Exec/v1_item.asp?userid=99365730795583&item_id=\1" dest="3">

<expression noclean="1">http://adult.dvdempire.com/Exec/v1_item.asp?userid=99365730795583&item_id=([0-9]*)</expression>

</RegExp>
</NfoUrl>

<CreateSearchUrl dest="3">

<RegExp input="$$1" output="http://adult.dvdempire.com/Exec/v1_search_all.asp?userid=99365743076143&string=\1&view=0&display_pic=0" dest="3">
<expression noclean="1"></expression>
</RegExp>
</CreateSearchUrl>


<GetSearchResults dest="8">
<RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="8">
<RegExp input="$$1" output="<entity><title>\2</title><url>http://adult.dvdempire.com/Exec/v1_item.asp?userid=99365730795583&amp;item_id=\1</url></entity>" dest="5">

<!-- This part seems to be working without a problem, the only weird thing is that it does not return as many results as the web site does
The results i'm looking for have the string SearchStrID
Example: <a href="/Exec/v1_item.asp?userid=99365730795583&item_id=1235858&searchStrID=2629">Lost</a>
!-->

<expression repeat="yes"><a href="/Exec/v1_item\.asp\?userid=[0-9]*&amp;item_id=([0-9]*)&amp;searchStrID=[0-9]*">(.[^<]*)</a></expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>
</GetSearchResults>

<GetDetails dest="3">
<RegExp input="$$5" output="<details>\1</details>" dest="3">



<!--Title-->
<RegExp input="$$1" output="<title>\1</title>" dest="5">
<expression noclean="1" trim="1"><title>Adult DVD Empire - ([^/%]*) -</expression>
</RegExp>


<!--Actors-->
<RegExp input="$$1" output="<actor><name>\1</name></actor>" dest="5+">
<expression repeat="yes">/Exec/v1_list_performer\.asp\?userid=[0-9]*&cast_id=[0-9]*&sort=[0-9]&apos;>(.[^<]*)</expression>
</RegExp>


<!--Director-->
<RegExp input="$$1" output="<director>\1</director>" dest="5+">

<expression noclean="1" repeaat="yes">/Exec/v1_list_director\.asp\?userid=[^>]*>([^<]*)</expression>
</RegExp>


<!--Rating-->
<RegExp input="$$1" output="<rating>\1</rating>" dest="5+">
<expression>Overall Rating.[^>]*>.[^>]*>([0-9.]+) out[^<]*<</expression>
</RegExp>

<!--Genre-->
<RegExp input="$$1" output="<genre>Adult / \1</genre>" dest="5+">
<expression>Category</b>\: <nobr>[^>]*>([^<]*)</expression>
</RegExp>


<!--Tagline-->
<RegExp input="$$1" output="<tagline>\1</tagline>" dest="5+">
<expression><td valign="top" class="fontsmall3">[^>]*>([^<]*)</i></expression>
</RegExp>


<!--Plot-->

<!--I had a really hard time trying to clean the plot. I was in fact not able to do it. They have randomly
added the letter "i" where there should be an space but you can't see the letter because it is the same
color as the background. Afte cleaning that you should be able to clean up the rest of the tags. Another
problem is that sometimes thre is no tags at all, just text so the "clena up search" has to be conditional.
this is an example:

<i>Experience A Place Where Your Wildest Fantasies...Are Only The Beginning</i><br><br>
From the<font face="verdana, arial, sans-serif" size="-1" color="#ffffff">i</font>award-winning
director of <I>Pirates</i>, comes <I>Island Fever 4</i>. Shot entirely
<font face="verdana, arial, sans-serif" size="-1" color="#ffffff">i</font>in HD on
<font face="verdana, arial, sans-serif" size="-1" color="#ffffff">i</font>location
<font face="verdana, arial, sans-serif" size="-1" color="#ffffff">i</font>in the
<font face="verdana, arial, sans-serif" size="-1" color="#ffffff">i</font>Bahamas and
<font face="verdana, arial, sans-serif" size="-1" color="#ffffff">i</font>Bora Bora, this special
triple-disc set includes 16 sex scenes,<font face="verdana, arial, sans-serif" size="-1" color="#ffffff">i
</font>a dozen steamy solo sequences, and<font face="verdana, arial, sans-serif" size="-1" color="#ffffff">i
</font>one of the<font face="verdana, arial, sans-serif" size="-1" color="#ffffff">i</font>most intense all-girl
orgies ever captured. Packed with over 3 hours of extreme erotic action, this unforgettable production also
includes an additional 2 hours of bonus material and<font face="verdana, arial, sans-serif"
size="-1" color="#ffffff">i</font>special features.<br><br>
See the<font face="verdana, arial, sans-serif" size="-1" color="#ffffff">i</font>trailer
<font face="verdana, arial, sans-serif" size="-1" color="#ffffff">i</font>for
<a href="http://www.digitalplayground.com/trailer.php?movid=111"target="new">Island Fever 4</a>!<br><br>-->


<RegExp input="$$8" output="<plot>\1</plot>" dest="5+">
<RegExp input="$$6" output="\1" dest="8">
<RegExp input="$$1" output="\1" dest="6">
<expression noclean="1"><td valign="top" class="fontsmall3">.([^\%]*)</td></expression>
</RegExp>
<expression repeat="yes" noclean="1">([^<]*)<[^>]*></expression>
</RegExp>
<expression></expression>
</RegExp>

<!--Year-->
<RegExp input="$$1" output="<year>\1</year>" dest="5+">
<expression>Production Year:[^/]*/font>([0-9]+)</expression>
</RegExp>


<!--MPAA-->
<RegExp input="$$1" output="<mpaa>\1</mpaa>" dest="5+">
<expression>Rating:[^/]*/font>([^<]*)</expression>
</RegExp>


<!--Runtime-->
<RegExp input="$$1" output="<runtime>\1 \2 \3 \4</runtime>" dest="5+">
<expression>Length[^/]*/font>([0-9]+)[^\;]*\;([A-Za-z]*)[^\;]*\;([0-9]+)[^\;]*\;([A-Za-z]*)</expression>
</RegExp>


<!--Thumb Front-->
<RegExp input="$$1" output="<thumb>http://images.dvdempire.com/res/movies/\1/\2h.jpg</thumb>" dest="5+">
<expression>/images.dvdempire.com/res/movies/([0-9]+)/([0-9]+).jpg</expression>
</RegExp>


<!--Thumb Back-->
<RegExp input="$$1" output="<thumb>http://images.dvdempire.com/res/movies/\1/\2bh.jpg</thumb>" dest="5+">
<expression>/images.dvdempire.com/res/movies/([0-9]+)/([0-9]+).jpg</expression>
</RegExp>!-->


<expression noclean="1"></expression>
</RegExp>
</GetDetails>
</scraper>
find quote