Python scraper

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
ztripez Offline
Junior Member
Posts: 48
Joined: May 2008
Reputation: 0
Post: #1
I've started a project similar to ScraperXML but in Python and the goal is compability with dharma+ addons.
However all information about scraper development is kind of (well thats a understatement) outdated, or perhaps I've missed something?

I'm trying to reverse engineer the ones that are included in dharma release but i'm getting very confused. Is there -any- information on how the dharma engine works with scrapers?

perhaps a flowchart Tongue?
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #2
code. see addons/Scraper.cpp, and video/VideoInfoDownloader.cpp
find quote
ztripez Offline
Junior Member
Posts: 48
Joined: May 2008
Reputation: 0
Post: #3
Oh, my c/c++ is very rusty. This will be interssting Tongue.

-Z
find quote
ztripez Offline
Junior Member
Posts: 48
Joined: May 2008
Reputation: 0
Post: #4
I've put up a git on github with the project. Not much yet since i started today. But here it is anyway.

https://github.com/ztripez/pyScraper
find quote
ztripez Offline
Junior Member
Posts: 48
Joined: May 2008
Reputation: 0
Post: #5
Ok, i've built an addon class that builds a stack with all functions from it's addon and from dependencyn.

I have a couple of questions though:

* The buffer(s) has 20 slots, is there a local buffer in every function or is it one global?


* A snippet from tmdb.xml:
Quote:<CreateSearchUrl dest="3">
<RegExp input="$$1" output="<url>http://api.themoviedb.org/2.1/Movie.search/$INFO[language]/xml/57983e31fb435df4df77afb854740ea9/\1</url>" dest="3">
<RegExp input="$$2" output="+\1" dest="4">
<expression clear="yes">(.+)</expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</CreateSearchUrl>

The basics are simple;
- Do regex-replace on buffer 1 with output and use buffer 1 as source and put the result in buffer 3.

However, sinces there are a nested RegExp should i run the regex on the parent buffer and if so, should i do it before or after i've applied the parents regex?
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #6
the buffers are global to the parser. if you dig a bit you'll see the 'clearbuffers=no' tag. that's a way to pass information between functions.

expressions are evaluated in an lifo/depth-search fashion, i.e. dig into the deepest one and evaluate that first.
find quote
ztripez Offline
Junior Member
Posts: 48
Joined: May 2008
Reputation: 0
Post: #7
spiff Wrote:the buffers are global to the parser. if you dig a bit you'll see the 'clearbuffers=no' tag. that's a way to pass information between functions.
But if the buffers are global for the scraper, why is the 'clearbuffers=no' needed? When does it clean itself?

spiff Wrote:expressions are evaluated in an lifo/depth-search fashion, i.e. dig into the deepest one and evaluate that first.
Alright, i thought so, thanks.


Thanks for the info
-Z
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #8
by default, if that tag isn't set, you clear the buffers at the end of a function call (or well, somewhere before the next function is called, but logic wise it's easiest to have it at the end of an evaluation).
find quote
ztripez Offline
Junior Member
Posts: 48
Joined: May 2008
Reputation: 0
Post: #9
Alright, thanks
find quote
fastestcomputer Offline
Junior Member
Posts: 1
Joined: Apr 2011
Reputation: 0
Post: #10
Thanks for the info
find quote