Scraper Idea...

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
galvanash Offline
Senior Member
Posts: 102
Joined: Jan 2010
Reputation: 0
Location: Mostly my couch
Post: #1
This might sound like sacrilege to some people, but it is something that has bothered me quite a bit while learning the scraper framework. I did search through the forums and could not find anything about this, so if this has been discussed to death before I apologize in advance.

Anyway, my idea concerns having to escape everything in the scrapers xml. I find I have a very hard time mentally parsing things because I need not only have to grok regex syntax, but I have to visualize what it will look like after it has been unescaped by the engine. I understand completely why escaping is required.

There may be tools that others are using that already do something like this (Im just starting to learn so I might have missed the existence of such a tool), but it occurred to me though that a "preprocessor" step could be added that would allow one to work directly in the scrapers xml without having to deal with escaping things manually. Its easier to show code than explain (this is a portion of the current imdb scraper):

Code:
<<[escape]>>
<?xml version="1.0" encoding="UTF-8"?>
<scraper framework="1.1" date="2010-10-02">
<NfoUrl dest="3">
  <RegExp input="$$1" output="[escape]<url>http://akas.\1/title/tt\2/</url><id>tt\2</id>[/escape]" dest="3">
   <expression clear="yes" noclean="1">(imdb.com)/Title\?([0-9]*)</expression>
  </RegExp>
  <RegExp input="$$1" output="[escape]<url>http://akas.\1\2/</url><id>tt\2</id>[/escape]" dest="3+">
   <expression noclean="1">(imdb.com/title/tt)([0-9]*)</expression>
  </RegExp>
</NfoUrl>
.....

This is probably mostly self explanatory to anyone doing scraping, but basically:

1. The <<[escape]>> part defines the escape token sequence. Surrounding it by <<>> makes the whole file invalid xml, so an xml parser that tries to read it should fail on it immediately. This is just for safety. The [escape] part defines a token "tag" (BBCode Style) - the string in the [] is arbitrary, it is up to the scraper dev to make sure it is unique in the current document. I name the files ".ppxml" to make sure they don't get mixed up with valid xml.

2. Anywhere you want to write an un-escaped string, you simply surround the string with [escape][/escape] (or whatever your arbitrary token sequence is).

3. A utility is run against the file to "compile" it down to valid XML. It simply uses the <<>> defined token sequence to find every string that needs to be escaped, escapes the string, and replaces the whole thing with the result. At the end of the process the final file is saved as ".xml" with the <<>> header removed.

I wrote a utility already that does this, and it works great. I wrote it in C# though so I doubt many people would want to use it as is. My questions are:

1. Is there any interest in this, or am I the only one this bothers?

2. Is there a linux dev that would be willing to create this in C, python, or whatever is more common on Linux. Ill happily send the code I have to whoever wants it, but this is so easy to write I doubt it would even be needed - its like a 20 liner in C# so you could probably do it in 5 lines in python Wink Hell, a shell script could probably do it fairly easily.

3. Does anyone have a reason NOT to want to see something like this used? It seems fairly bullet proof to me, but I might be missing something.

Thoughts?
(This post was last modified: 2011-03-02 07:44 by galvanash.)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #2
http://www.w3schools.com/xml/xml_cdata.asp
find quote
galvanash Offline
Senior Member
Posts: 102
Joined: Jan 2010
Reputation: 0
Location: Mostly my couch
Post: #3
spiff Wrote:http://www.w3schools.com/xml/xml_cdata.asp

But you can't use CDATA inside of an xml attribute... That would work for stuff between tags, but not stuff inside of attributes (which is what output is and that is where i find this causes me the most grief).

I did consider using the CDATA tag instead of a custom token sequence, but I thought it might end up being misinterpreted by people as valid syntax when in fact any validating parser would break on it... I figured using a "header" the way I did would make it obvious that what you are looking at is not strict XML.
(This post was last modified: 2011-03-02 21:11 by galvanash.)
find quote
galvanash Offline
Senior Member
Posts: 102
Joined: Jan 2010
Reputation: 0
Location: Mostly my couch
Post: #4
spiff Wrote:http://www.w3schools.com/xml/xml_cdata.asp

Another thought... If the scraper parsing were extended to allow an alternate syntax specifying <RegExp> elements using child tags instead of attributes, the whole problem could be neatly dealt with without needing a preprocessor at all:

Code:
<?xml version="1.0" encoding="UTF-8"?>
<scraper framework="1.1" date="2010-10-02">
<NfoUrl dest="3">
  <RegExp>
   <input>$$1</input>
   <output><![CDATA[ <url>http://akas.\1/title/tt\2/</url><id>tt\2</id> ]]></output>
   <expression clear="yes" noclean="1">(imdb.com)/Title\?([0-9]*)</expression>
   <dest>3</dest>
  </RegExp>
  <RegExp>
   <input>$$1</input>
   <output><![CDATA[ <url>http://akas.\1\2/</url><id>tt\2</id> ]]></output>
   <expression noclean="1">(imdb.com/title/tt)([0-9]*)</expression>
   <dest>3+</dest>
  </RegExp>
</NfoUrl>
.....

I looked at ScraperParser.cpp, it would be a fairly trivial patch... Trivial enough that I could do it - but would anyone be interested having this as an alternate syntax besides me Smile
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #5
i see no problems supporting both.
find quote