2011-03-02, 07:40
This might sound like sacrilege to some people, but it is something that has bothered me quite a bit while learning the scraper framework. I did search through the forums and could not find anything about this, so if this has been discussed to death before I apologize in advance.
Anyway, my idea concerns having to escape everything in the scrapers xml. I find I have a very hard time mentally parsing things because I need not only have to grok regex syntax, but I have to visualize what it will look like after it has been unescaped by the engine. I understand completely why escaping is required.
There may be tools that others are using that already do something like this (Im just starting to learn so I might have missed the existence of such a tool), but it occurred to me though that a "preprocessor" step could be added that would allow one to work directly in the scrapers xml without having to deal with escaping things manually. Its easier to show code than explain (this is a portion of the current imdb scraper):
This is probably mostly self explanatory to anyone doing scraping, but basically:
1. The <<[escape]>> part defines the escape token sequence. Surrounding it by <<>> makes the whole file invalid xml, so an xml parser that tries to read it should fail on it immediately. This is just for safety. The [escape] part defines a token "tag" (BBCode Style) - the string in the [] is arbitrary, it is up to the scraper dev to make sure it is unique in the current document. I name the files ".ppxml" to make sure they don't get mixed up with valid xml.
2. Anywhere you want to write an un-escaped string, you simply surround the string with [escape][/escape] (or whatever your arbitrary token sequence is).
3. A utility is run against the file to "compile" it down to valid XML. It simply uses the <<>> defined token sequence to find every string that needs to be escaped, escapes the string, and replaces the whole thing with the result. At the end of the process the final file is saved as ".xml" with the <<>> header removed.
I wrote a utility already that does this, and it works great. I wrote it in C# though so I doubt many people would want to use it as is. My questions are:
1. Is there any interest in this, or am I the only one this bothers?
2. Is there a linux dev that would be willing to create this in C, python, or whatever is more common on Linux. Ill happily send the code I have to whoever wants it, but this is so easy to write I doubt it would even be needed - its like a 20 liner in C# so you could probably do it in 5 lines in python Hell, a shell script could probably do it fairly easily.
3. Does anyone have a reason NOT to want to see something like this used? It seems fairly bullet proof to me, but I might be missing something.
Thoughts?
Anyway, my idea concerns having to escape everything in the scrapers xml. I find I have a very hard time mentally parsing things because I need not only have to grok regex syntax, but I have to visualize what it will look like after it has been unescaped by the engine. I understand completely why escaping is required.
There may be tools that others are using that already do something like this (Im just starting to learn so I might have missed the existence of such a tool), but it occurred to me though that a "preprocessor" step could be added that would allow one to work directly in the scrapers xml without having to deal with escaping things manually. Its easier to show code than explain (this is a portion of the current imdb scraper):
Code:
<<[escape]>>
<?xml version="1.0" encoding="UTF-8"?>
<scraper framework="1.1" date="2010-10-02">
<NfoUrl dest="3">
<RegExp input="$$1" output="[escape]<url>http://akas.\1/title/tt\2/</url><id>tt\2</id>[/escape]" dest="3">
<expression clear="yes" noclean="1">(imdb.com)/Title\?([0-9]*)</expression>
</RegExp>
<RegExp input="$$1" output="[escape]<url>http://akas.\1\2/</url><id>tt\2</id>[/escape]" dest="3+">
<expression noclean="1">(imdb.com/title/tt)([0-9]*)</expression>
</RegExp>
</NfoUrl>
.....
This is probably mostly self explanatory to anyone doing scraping, but basically:
1. The <<[escape]>> part defines the escape token sequence. Surrounding it by <<>> makes the whole file invalid xml, so an xml parser that tries to read it should fail on it immediately. This is just for safety. The [escape] part defines a token "tag" (BBCode Style) - the string in the [] is arbitrary, it is up to the scraper dev to make sure it is unique in the current document. I name the files ".ppxml" to make sure they don't get mixed up with valid xml.
2. Anywhere you want to write an un-escaped string, you simply surround the string with [escape][/escape] (or whatever your arbitrary token sequence is).
3. A utility is run against the file to "compile" it down to valid XML. It simply uses the <<>> defined token sequence to find every string that needs to be escaped, escapes the string, and replaces the whole thing with the result. At the end of the process the final file is saved as ".xml" with the <<>> header removed.
I wrote a utility already that does this, and it works great. I wrote it in C# though so I doubt many people would want to use it as is. My questions are:
1. Is there any interest in this, or am I the only one this bothers?
2. Is there a linux dev that would be willing to create this in C, python, or whatever is more common on Linux. Ill happily send the code I have to whoever wants it, but this is so easy to write I doubt it would even be needed - its like a 20 liner in C# so you could probably do it in 5 lines in python Hell, a shell script could probably do it fairly easily.
3. Does anyone have a reason NOT to want to see something like this used? It seems fairly bullet proof to me, but I might be missing something.
Thoughts?