Login at Kodi Home

Nicezia · 2009-04-25, 11:44

Only reason Why I ask is because i'm developing a scraper editor/tester for the current format of the XML...

Sneak Peek:

So far its awfully simple but foran amateur programmer behind the scenes its pretty complex.

I've already implemented all of the options in the class (minus the custom functions which I'm having a difficult time wrapping my headd around just how i want to go about implementing those, and so far nesting RegExp's is not exactly as complex as the actual XML allows)

But if I follow my todo list it should be able to create the simpler movie scrapers by Thursday or Friday...

So again my question is, do you foresee me having to relearn the structure of your scraper XML anytime in the next few months?

Schenk2302 · 2009-04-25, 12:17

Hi,

it seems you could read my mind. I'm really looking forward to your editor Smile

Thanks in advance

Schenk

**mkortstiege** · 2009-04-25, 12:39

Cool. Nope, scrapers won't change except of regexp stuff, but the layout (backend) willl most likely stay the same. Any plans of making it for other platforms as well? Maybe MONO is the way to go.

Nicezia · 2009-04-25, 19:26

Yeah, actually I've just made a copy of the class written in c++ considering my C++ isn't really up to par, the c++ version may be a little longer in the making...
but doing it in visual basic for now is giving me insight eough to be able to visualize how to port it to c++ (sans .NET). the only real problem i see so far in making it cross platform is
A) breaking my reliance on linq.... (but linq makes XML so easy....)
B) I have no experience in building a gui without visual studio designer (i am looking into information on x-windows programming though)

I'll look into cross platforming & particularly Mono as soon as i can get a working version out, first things first.

rwparris2 · 2009-04-25, 20:26

Looks cool!
I started writing a tester in python, but haven't gotten very far with it.

If you know any python maybe we could work together? I don't have any experience with c#/c++ so I wouldn't be much help there.

Also, IMO a GUI for writing scrapers isn't really necessary. It is nice, but being crossplatform is nicer Smile

Editing the scraper's XML directly isn't hard if you have a simple way to test it.

**spiff** · 2009-04-25, 20:33

great stuff!

the scraper xml format will not change, however i will change the default to matching regular expressions case insensitive and add a tag to indicate sensitive matching (see #6262). also i plan some minor changes on the returned xml format, but that doesn't really matter for you.

the key to the nesting stuff and other functions is recursive code (see CIMDB::InternalGetDetails).

Nicezia · (This post was last modified: 2009-04-25, 22:09 by Nicezia.)

rwparris2 Wrote:Looks cool!
I started writing a tester in python, but haven't gotten very far with it.

If you know any python maybe we could work together? I don't have any experience with c#/c++ so I wouldn't be much help there.

Also, IMO a GUI for writing scrapers isn't really necessary. It is nice, but being crossplatform is nicer Editing the scraper's XML directly isn't hard if you have a simple way to test it.

Sure it isn't neccessary... but I started this because i got tired of making the tiny stupid mistakes, like leaving a semicolon off the end of an entity. First i wrote a little program to replace entities in a string of text. Then my thinking turned into it would be nice to be able to see what's going into what buffer as i'm building my sections, and then i figured why not take it all the way....

As for python, that's my next area of study, (thinking about making a python plugin for XBMC for libpurple so instant messaging can be integrated into XBMC)

spiff Wrote:great stuff!

the scraper xml format will not change, however i will change the default to matching regular expressions case insensitive and add a tag to indicate sensitive matching (see #6262). also i plan some minor changes on the returned xml format, but that doesn't really matter for you.

the key to the nesting stuff and other functions is recursive code (see CIMDB::InternalGetDetails).

Cool I've already allowed for expanding to support new attributes (Assuming you plan to use the case insensitivity as a conditional attribute)

rwparris2 · 2009-04-25, 22:24

What are your thoughts on this thing working crossplatform? Will you be able to get away from linq? Can you make the gui and backend seperate so that it can be run from a command line?

I need to know whether to continue with my python implementation or not... linux & osx users could use a similar tool as well Smile

Nicezia · (This post was last modified: 2009-04-25, 22:44 by Nicezia.)

rwparris2 Wrote:What are your thoughts on this thing working crossplatform? Will you be able to get away from linq? Can you make the gui and backend seperate so that it can be run from a command line?

I need to know whether to continue with my python implementation or not... linux & osx users could use a similar tool as well

Definately on the cross platforming, plan on making a console implementation of it as well (my initial reason for rewriting the class in C)

When it gets to that point however i might need some help as far as setting it up so that it can compile in different enviornments (makefiles and all that since i have never worked with compiler options before .. my only programming experience is with Visual Studio flavors... ) I'm going to switch the entire project to Mono as suggested as soon as i finish coding it in Visual Basic... seeing as it has Linux & Mac implementations that should solve the cross platforming problem effectively.

It would be shortsighted of me not to make it cross platform, since i actually use XBMC through my Linux box.

Nicezia · 2009-04-25, 22:43

spiff Wrote:the key to the nesting stuff and other functions is recursive code (see CIMDB::InternalGetDetails).

Thanks a million for that reference!!!! I should be able to implement nesting fully now

althekiller · 2009-04-25, 22:49

May I suggest changing the name to "ScrapeMe"? Smile

No sense in promoting all the uneducated folk calling them "scrappers" in the forums.

Nicezia · 2009-04-26, 03:44

The name of it is ScrapeMe, that was just a typo when i was making created the project and i've just been to busy coding to go back and fix it, as you'll notice in the actual running window the name is correct (I got the name cause i was listening to Nirvana's "Rape Me" when i came up with the idea!

Nicezia · (This post was last modified: 2009-04-26, 08:26 by Nicezia.)

Okay i have a few questions about the how the scraper xml communicates with XBMC

when it starts execution of nested statements does it start from the deepest nested RegExp or from the outer RegExp.

I tried reading the C++ source code, but its a bit complex for me to read, with lots of stuff i still don't understand.

I've already got the tester working at the root level RegExp and already managed to get a entire scraper running on the tester all the way through (a really simple one i wrote no custom functions and no nested expresssions

Code:
<RegExpA>

    <RegExpB>

       <RegExpC>

           <expression/>

       </RegExpC>

       <expression/>

   </RegExpB>

   <espression/>

</RegExpA>

would A or C be the first to execute?

And is there 9 buffers total to save data or 9 buffers per expression?

**spiff** · 2009-04-26, 13:15

it's evaluated as a lifo, last in, first out, i.e. innermost first, so C then B then A.

there is a total of 20 buffers and they are global to the scraper parser.
usually these are cleared after executing a function, unless the clearbuffers="no" param is set.
the reason for having this parameter avail, is that it allows for passing info between functions that is executed after each others (i.e. <url function="foo"..> chains)

Nicezia · (This post was last modified: 2009-04-26, 21:23 by Nicezia.)

so if i'm understanding right there are 20 global buffers (cleared between functions unless specified) and there are 9 buffers available for RegExp captures, and execution of expressions works its way backwards towards the root expression?

Last question i need to ask is about noclean... what exact html is stripped if this is NOT set?