Kodi Community Forum
ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... - Printable Version

+- Kodi Community Forum (https://forum.kodi.tv)
+-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32)
+--- Forum: Scrapers (https://forum.kodi.tv/forumdisplay.php?fid=60)
+--- Thread: ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... (/showthread.php?tid=50055)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22


- Nicezia - 2009-06-07

OK All issues fixed, just converting to C# and making gui & console test programs for linux. Girlfriend suprised me with a week vacation (which is where i am right now, spend most of my time with her.. so won't be updating for about a week, but still programming it in my downtime)

soon as i get back i should not only have the test programs done , but a full fledged media manager for Linux(via mono & GTK#) and Windows.


- ultrabrutal - 2009-06-21

tmdb scraper crashes for me:

This is the url it generates for Terminator:

<url>http://api.themoviedb.org/2.0/Movie.search?title=terminator&amp;api_key=57983e31fb435df4df77afb854740ea9</url>

It works in a web browser, but fails in the program with exception:


See the end of this message for details on invoking
just-in-time (JIT) debugging instead of this dialog box.

************** Exception Text **************
System.Xml.XmlException: The 'results' start tag on line 1 does not match the end tag of 'result'. Line 1, position 1055.
at System.Xml.XmlTextReaderImpl.Throw(Exception e)
at System.Xml.XmlTextReaderImpl.ThrowTagMismatch(NodeData startTag)
at System.Xml.XmlTextReaderImpl.ParseEndElement()
at System.Xml.XmlTextReaderImpl.ParseElementContent()
at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r)
at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r, LoadOptions o)
at System.Xml.Linq.XDocument.Load(XmlReader reader, LoadOptions options)
at System.Xml.Linq.XDocument.Parse(String text, LoadOptions options)
at ScraperXML.ScraperParser.GetSearchResults(String strUrl)
at ScraperXML_Test_Program.Form1.Button2_Click(Object sender, EventArgs e)
at System.Windows.Forms.Control.OnClick(EventArgs e)
at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)


************** Loaded Assemblies **************
mscorlib
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3074 (QFE.050727-3000)
CodeBase: file:///C:/Windows/Microsoft.NET/Framework64/v2.0.50727/mscorlib.dll
----------------------------------------
ScraperXML Test Program
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Users/bleze/Desktop/ScraperXML-1.0/ScraperXML%20Test%20Program/bin/Release/ScraperXML%20Test%20Program.exe
----------------------------------------
Microsoft.VisualBasic
Assembly Version: 8.0.0.0
Win32 Version: 8.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/Microsoft.VisualBasic/8.0.0.0__b03f5f7f11d50a3a/Microsoft.VisualBasic.dll
----------------------------------------
System
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System/2.0.0.0__b77a5c561934e089/System.dll
----------------------------------------
System.Windows.Forms
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Windows.Forms/2.0.0.0__b77a5c561934e089/System.Windows.Forms.dll
----------------------------------------
System.Drawing
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Drawing/2.0.0.0__b03f5f7f11d50a3a/System.Drawing.dll
----------------------------------------
System.Runtime.Remoting
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Runtime.Remoting/2.0.0.0__b77a5c561934e089/System.Runtime.Remoting.dll
----------------------------------------
ScraperXML
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Users/bleze/Desktop/ScraperXML-1.0/ScraperXML%20Test%20Program/bin/Release/ScraperXML.DLL
----------------------------------------
Accessibility
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/Accessibility/2.0.0.0__b03f5f7f11d50a3a/Accessibility.dll
----------------------------------------
System.Xml.Linq
Assembly Version: 3.5.0.0
Win32 Version: 3.5.30729.1 built by: SP
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Xml.Linq/3.5.0.0__b77a5c561934e089/System.Xml.Linq.dll
----------------------------------------
System.Core
Assembly Version: 3.5.0.0
Win32 Version: 3.5.30729.1 built by: SP
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Core/3.5.0.0__b77a5c561934e089/System.Core.dll
----------------------------------------
System.Xml
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3074 (QFE.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Xml/2.0.0.0__b77a5c561934e089/System.Xml.dll
----------------------------------------
System.Configuration
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Configuration/2.0.0.0__b03f5f7f11d50a3a/System.Configuration.dll
----------------------------------------

************** JIT Debugging **************
To enable just-in-time (JIT) debugging, the .config file for this
application or computer (machine.config) must have the
jitDebugging value set in the system.windows.forms section.
The application must also be compiled with debugging
enabled.

For example:

<configuration>
<system.windows.forms jitDebugging="true" />
</configuration>

When JIT debugging is enabled, any unhandled exception
will be sent to the JIT debugger registered on the computer
rather than be handled by this dialog box.


- ultrabrutal - 2009-06-21

Never mind. Found the problem to be a wrong end tag in tmdb.xml:

<results>...</result>

Missing a 's' there and then it works. Perhaps a dev will kindly fix this in SVN


- ultrabrutal - 2009-06-21

I get this exception when I have a "text" setting when inputting in the field:

See the end of this message for details on invoking
just-in-time (JIT) debugging instead of this dialog box.

************** Exception Text **************
System.NullReferenceException: Object reference not set to an instance of an object.
at ScraperXML.ScraperSetting.set_Paramater(String value)
at ScraperXML_Test_Program.Form1.DynamicTextBox_TextChanged(Object sender, EventArgs e)
at System.Windows.Forms.Control.OnTextChanged(EventArgs e)
at System.Windows.Forms.TextBoxBase.WmReflectCommand(Message& m)
at System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)


************** Loaded Assemblies **************
mscorlib
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3074 (QFE.050727-3000)
CodeBase: file:///C:/Windows/Microsoft.NET/Framework64/v2.0.50727/mscorlib.dll
----------------------------------------
ScraperXML Test Program
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Users/bleze/Desktop/ScraperXML-1.0/ScraperXML%20Test%20Program/bin/Release/ScraperXML%20Test%20Program.exe
----------------------------------------
Microsoft.VisualBasic
Assembly Version: 8.0.0.0
Win32 Version: 8.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/Microsoft.VisualBasic/8.0.0.0__b03f5f7f11d50a3a/Microsoft.VisualBasic.dll
----------------------------------------
System
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System/2.0.0.0__b77a5c561934e089/System.dll
----------------------------------------
System.Windows.Forms
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Windows.Forms/2.0.0.0__b77a5c561934e089/System.Windows.Forms.dll
----------------------------------------
System.Drawing
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Drawing/2.0.0.0__b03f5f7f11d50a3a/System.Drawing.dll
----------------------------------------
System.Runtime.Remoting
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Runtime.Remoting/2.0.0.0__b77a5c561934e089/System.Runtime.Remoting.dll
----------------------------------------
ScraperXML
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Users/bleze/Desktop/ScraperXML-1.0/ScraperXML%20Test%20Program/bin/Release/ScraperXML.DLL
----------------------------------------
Accessibility
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/Accessibility/2.0.0.0__b03f5f7f11d50a3a/Accessibility.dll
----------------------------------------
System.Xml.Linq
Assembly Version: 3.5.0.0
Win32 Version: 3.5.30729.1 built by: SP
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Xml.Linq/3.5.0.0__b77a5c561934e089/System.Xml.Linq.dll
----------------------------------------
System.Core
Assembly Version: 3.5.0.0
Win32 Version: 3.5.30729.1 built by: SP
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Core/3.5.0.0__b77a5c561934e089/System.Core.dll
----------------------------------------
System.Xml
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3074 (QFE.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Xml/2.0.0.0__b77a5c561934e089/System.Xml.dll
----------------------------------------
System.Configuration
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.3053 (netfxsp.050727-3000)
CodeBase: file:///C:/Windows/assembly/GAC_MSIL/System.Configuration/2.0.0.0__b03f5f7f11d50a3a/System.Configuration.dll
----------------------------------------

************** JIT Debugging **************
To enable just-in-time (JIT) debugging, the .config file for this
application or computer (machine.config) must have the
jitDebugging value set in the system.windows.forms section.
The application must also be compiled with debugging
enabled.

For example:

<configuration>
<system.windows.forms jitDebugging="true" />
</configuration>

When JIT debugging is enabled, any unhandled exception
will be sent to the JIT debugger registered on the computer
rather than be handled by this dialog box.


- Nicezia - 2009-06-23

ScraperXML development's on hold for a little while.

most of the problems in scraperXML right now are from either gathering the data (from different types and different methods from each type, or XML parsing...

but right now i'm researching Charsets so that i can make it support foreign scrapers as well...

I'll look into the problem but because of budget i don't have any internet to test with, so i'm using local copies of HTML that i download at work and take home... so its slow going.


hmmm...
I know that the problem is with that thoufgh and i can fix it soon, and reup the source (though ScraperXML is being completely re-writtten in C# now ... since C# is my new language of choice)


- tslayer - 2009-06-23

Cut out food instead. You need internet.


- Nicezia - 2009-06-23

yeah i was considering that, but considering i've lost 70 pounds in the last three months while eating normally .. not a good idea!!!


- Nicezia - 2009-06-23

you know it would be a great help if the scrapers pooled all info so that i didn't have to make individual content handlers, that check and recheck for duplicate entries after every custom function, in that way both ScraperXML AND XBMC would only have to update info at the very end of running "GetDetails" instead of having to call it after every custom function, this is the one thing i'm having trouble with as some custom functions don't return fully qualified XML Elements and the parsers i use can't handle fragments. (And my exception handling skills aren't quite up to par yet.)


- smeehrrr - 2009-06-27

Any chance of adding the Helper Objects directory to SVN? It looks like it's missing.


- Nicezia - 2009-06-27

The newest version will be out next weekend (July 4th weekend), with better content handling, better network support and written completely in C#

I'm hunting down all the errors i can, and (thanks to creating scraperXML editor) I have a better understanding of some of the places i went wrong with scraperxml... Also error handling in ScraperXML is better, and logging is managed differently(there are two levels of logging, verbose (which creates a log entries for every step of the important Scraper Methods (From Parsing the RegExp/expression, to applying the regular expression to the input,) and Error only mode (which only creates a Log entry, when something goes critically wrong and the scraper is unable to return anything at all)

Also there are several Namespaces now included.

There's ScraperXML (which houses the object definitions for the base of including ScraperXML into a HTPC program) ContentHandlers (which houses objects that can be used to store information retrieved from scrapers or to integrate into Media Management Programs) and ScraperLib (which houses the framework used to edit scrapers - the basic framework for ScraperXML Editor), and Utilities (which is just Network & CharSet Conversion tools).


- fekker - 2009-06-28

nice work, looking forward to putting this to use in UMC (Universal Media Companion, aka UMM).. be back in two weeks and will be hitting ya up for sure Big Grin

- fekker

Sidenote: any chance of using a .net 2.0 target framework? (just an idea)


- Gamester17 - 2009-06-28

Nicezia Wrote:The newest version will be out next weekend (July 4th weekend), with better content handling, better network support and written completely in C#
Any chance of then targeting Mono as the default framework? Huh

http://en.wikipedia.org/wiki/Mono_(software)
http://www.mono-project.com

Mono 2.x aims to be compatible with Microsoft .NET 2.0 Framework, so that if it works with Mono then it should also work Microsoft .NET 2.0 Framework. This way it will be possible to use the ScraperXML library on Linux, Mac, and Windows as long as you have the Mono 2.0 Framework or later installed, (well on Windows it should then also automatically work with Microsoft .NET 2.0 Framework as well).

You might also want to checkout the MonoDevelop IDE:
http://en.wikipedia.org/wiki/MonoDevelop
http://www.monodevelop.com

...though should still be able to work in Microsoft Visual Studio too with the same code Wink

PS! Mono 2.4 is said to be compatible with C# 3.0 (including LINQ) but best to stick with .NET 2.0 to be on the safe side.


- Nicezia - 2009-06-28

thus far its completely compatible with mono 2.0.1

I've been targeting all my programming at mono compatibility these days, (it'd be nice if there was a native C# for linux, but alas, no).

I'd like to do it all fully in C++ but i don't have a full grasp on C++ (although cSharp is definately giving me a deeper understanding - there is still to much in depth stuff i'd have to look into at the moment to convert what i have to pure c++)

Also Already using MonoDevelop.

And Mono 2.0.1 is also compatible with C# 3.0 and .NET 3.5, though i'm trying to stick to .NET 2.0.


Does anyone know who the writer of the AllMusic scraper is? - Nicezia - 2009-06-29

I need to figure out what the person who wrote the AllMusic scraper's intent is in the custom function "GetDiscography", thinking i might be missing something in my Library

Code:
<GetDiscography dest="5">
        <RegExp input="$$2$$3" output="&lt;details&gt;\1&lt;/details&gt;" dest="5">
            <RegExp input="$$1" output="&lt;album&gt;&lt;year&gt;\1&lt;/year&gt;&lt;title&gt;\3&lt;/title&gt;&lt;label&gt;\4&lt;/label&gt;&lt;/album&gt;" dest="2">
                <expression repeat="yes" clear="yes" noclean="1,3,4">sorted-cell&quot;&gt;([0-9]+)&lt;/td&gt;&lt;td[^&gt;]*&gt;(&lt;a href=[^&gt;]*&gt;&lt;img [^&gt;]*/&gt;&lt;/a&gt;|[^&lt;]*)?&lt;/td&gt;&lt;td[^&gt;]*&gt;&lt;a href=[^&gt;]*&gt;([^&lt;]*)&lt;/a&gt;&lt;/td&gt;&lt;td[^&lt;]*&lt;/td&gt;&lt;td[^&gt;]*&gt;([^&lt;]+)&lt;/td&gt;</expression>
            </RegExp>
            <RegExp input="$$2" output="\1&amp;amp;\2" dest="3">
                <expression noclean="1,2" repeat="yes">(.*?)&amp;(.+)</expression>
            </RegExp>
            [b]<RegExp input="$$3" output="" dest="2">
                <expression>(.+)</expression>
            </RegExp>[/b]
            <expression noclean="1"></expression>
        </RegExp>
    </GetDiscography>

it seems to me that the final nested statement will delete the info in $$2.... but what if there are no &amp; in 2 to begin with? then what happens is the discography ends up empty and this function returns nothing (even if albums are found. (.+) indicates there needs to be at least one charachter but for some reason it still clears 2)

Maybe there's something i'm not understanding about the Scraper process.

(I do know though, that in my code this check for ampersands is unneccessary as i take all precautions on the ampersands to make things parseable)



EDIT: Nevermind, i just realized that i was making "\1" the default for output.. that's why it was deleting whether or not there was anything there


A question - Nicezia - 2009-07-01

why does the nfo not support the same format as is returned by the scrapers?

for instance, why does it not support multiple thumb tags

Code:
<thumbs>
   <thumb>foo.jpg</thumb>
   <thumb>foo2.jpg</thumb>
</thumbs>

i think it would be a good idea for the nfo to be a direct reflection of what can be returned by the scrapers, as it would simplify passing multiple values from metadata/media managers that might want to share info with XBMC