Looking for the ultimate HTTP fetch function
#16
hi coders!

i'm trying to code my first script, i code some php and mysql usually-

i newer did something in python before, but now i have the "basics", i made a script with some controls an navigation to learn the technique.

now i want to shape out some content from html pages and i'm really getting stuck with this. i tried many days now, reading some tutorials i found, but there is no success, everythink i try dont work. :verysad:

regular exp. are so difficult! how did you learn? any hints, especially on html parsing?

are there any other ways to find my strings in html code, than regexp?

cu, a surrendering lolol Sad
Reply
#17
regexbuddy... http://www.regexbuddy.com/

without this, i would never be able to get anywhere!! it allows you to deal with regex strings logically and piece by piece. then, it exports the regex to whatever code you want (python, etc...)

i too am fighting with python and html parsing though. i've got the regex setup to find the right code, but now i'm stuck knee deep in passing info to and from forms and cookies.

if anyone out there feels benevolent, a useful tutorial on how to grab something from and html page, and how to send back info to a page would be incredibly helpful.

thanks all...
Reply
#18
phunck... the cachemanager is awesome. i for one would deeply love some basic cookie support. it's the one thing i can't seem to master for the mlb.tv script i'm writing. i'm trying to use the clientcookie python addon and it's just killing me.

if anyone has any ideas or shortcuts, i'm trying to log into a site that uses cookies. i've been trying to learn via the nrkbrowser, but i'm just not talented enough (yet!!Wink to figure it out.
Reply
#19
thank you!

hope the buddy will help me, i'll look at it!

what is with linebreakes in html? are they are ignored? not meaning the <br>'s, what is with the non-visible breaks?

how to handle them?

glad to be not alone, lolol!
Reply
#20
my method of html parsing makes sense to me and might help you out. my strategy is to match the beginning of a tag... allow any number of characters that do not match the end tag character, then match the end tag. like if i wanted to get an address out of an <a href="[url]http://something.com">[/url] tag i'd do...

Quote:p = re.compile('<a href="([^"]*)">',re.ignorecase)
result = re.findall(p, data)

[^"] is any character that isnt "
[^"]* is any number of characters that arent "
[^"]*" is any number of characters that arent " followed by a "
"([^"]*)" allows me to extract the part in parenthesis (ie all the non quote characters between the quotes) which is the part i want

^ means not,
* is 0 or more occurances,
? is 0 to 1 occurances

thats all i use to build the following reg-ex that i use to parse the fields out of a table from yahoo finance. again this probably isnt the most efficient way to do it but it makes the most sense to me.

Quote:p = re.compile('<a href="/q\?s=([^"]*)">[^<]*</a></b></td><td [^>]*>([^<]*)</td><td [^>]*><b>([^<]*)</b></td><td [^>]*>(<img [^>]*>)? ?<b[^>]*>([^<]*)</b></td><td [^>]*><b[^>]*>([^<]*)</b></td><td [^>]*>([^<]*)</td>', re.ignorecase)
stocklines = re.findall(p, data)
stockdata[:] = []
#1 = name, 2 = date, 3=price 4=up/down 5=difference 6=perdiff 7=volume
for stock in stocklines:
s = stock(stock[0])
s.date = stock[1]
s.price = stock[2]
s.change = stock[4]
s.perchange = stock[5]
s.volume = stock[6]

stockdata.append(s)

i hope that helps.
Reply
#21
this is getting a bit off-topic, but this thread has to do with fetching data, so..

using clientcookie is almost the most easiest thing in the world as it's cookiehandling is fully automated. to open a site you use it just like urllib:

data = clientcookie.urlopen("[url]www.test.com"[/url]).read()

if that page sets any cookies, clientcookie will automatially store them.

to login to a page you need to check the form (the places you can enter information into) by checking the site's source. look in def login() in nrkbrowser. you need to return data to the server based on the form the site uses. to do this you create a dictionary with the data, and then use urllib.encode(yourdictionary).

ex:

form = {
login : yourlogin,
password : yourpassword
}

this would work on a form where the names/id's of the fields where you enter information are login and password.

you then open the page which the submit button "sends the data" to.
this could be in the html source:
<form method="post" action="login.asp">
you see that it makes the browser post the content to login.asp. we then do exactly that:

clientcookie.urlopen("[url]www.test.com/login.asp"[/url], urllib.urlencode(form))

i hope this made some sense...
xbmcscripts.com administrator
Reply
#22
i might as well add some more more to this, but this will be more general text related. what asteron said will be fine for links afaik (although i do things differently, but i doubt that it's better).

if you want all of the text in an area you can use a dot, . , which is any character. so, if you want say a large text from a  site you can do re.search('<div>(.*)</div>', re.ignorecase | re.dotall) .

now let's see what this does. it search for a <div>, when that's found it will get all text inbetween until it gets to </div>. notice the re.dotall. normally re.search stops when it comes to a new line, but dotall will make it get all text, linebreaks and all, until it hits the end of th expression. the ignorecase of course just let's you ignore the case of the expression you are searching. that means it would also accept ie <div> . might come in handy sometime...i guess a full guide should be added to xbmcscripts.com sometime, please submit me any examples you got.
xbmcscripts.com administrator
Reply
#23
my technique for writing re is to start wit a very simple one, which usually yields too many matches.
then refine it step by step by adding one pattern at a time and doing a testrun.

i always recommend testing on your pc using either a simple testscript or the xbmc emulator.

when parsing html there are some things to keep in mind:

-try to make your re independent of html formatting
web designers tend to change the layout of their pages quite often. if you include formattinjg elements like sizes, colors, fontnames in your re it is quite likely your re and with it your script will fail the next time the web designer changes their layout.
this is not always possible, but sometimes you can add some more wildcard to prevent this.

-watch out for the greedy .* pattern
most re and samples use .* as a typical wildcard. when parsing html you usually want to find some tags like:
 <a href=".*">.*</a>
but the .* pattern is greedy. so the re will find the very last closing </a> tag in the html document
by using ? modifier you tell the re that it should not be greedy. ie it should match the first match.
so the re above should be rewritten as:
 <a href=".*?">.*?</a>

hope it helps
bernd
Reply
#24
hi!

yeah, thx guys for your hints. it's motivating me a lot.

i really needed this tips!

i'm searching around the web for some easy python regexp tutorials, i'll post here, when i find anythink helpful.

cu lolol
Reply
#25
Thumbs Up 
i played around with phuncks cachemanager.
it works great. so lets check against the requirements:

-timeout for connecting
yes! Nod

-timeout for reading data (function should return if the server hasn't send any data for a specified amount of time)
couldn't test how it reacts when the server starts and then stops sending data. can anybody point me to an unreliable server?
but i tested the other way. set the timeout to 1 sec and downloaded a large file. and the timeout didn't fire.
did some code review and would say: yes! Nod

-user agent
yes!
tested against http://webtips.dan.info/cgi-bin/browser.pl and it works! Nod

-some kind of progress callback so ui-updates/progress can be done
yes! Smile but could be more flexible!
currently cachemanager.py has a progressdialog build in.
i think it would be great to make this more customizable. maybe a memberfunction ondataretrieved(percentage, bytesreceived, bytestotal) which can be overloaded to customize the progress. i just don't want to force scriptwriter to have a progress dialog.

other things i found while testing:
-currently there is no method to clear the entire cache.
you can remove one url from the cache but not all?!
in your sample you had a method, but i couldn't find that in the version i downloaded. maybe it just got lost during 'reorganisation'

-whould be great if one don't has to call createfolders explicitly and the class would do it automatically.

-maybe the class could be renamed to something more http releated like 'cachedhttp', 'httputils', 'fetchex' or <yoursuggestionhere>
its just that cachemanager doesn't instantly brings http, fetch functionality to my mind.


here is a simple sample on how to use it:
Quote:import cachemanager

cm = cachemanager.cachemanager()
cm.createfolders()
cm.setuseragent('myuseragent') # optional

# simple return the contents of the url as a string
# timeouts can be set using setsockettimeout()
html = cm.urlopen('http://www.stfu.se')
print html

# download to a file in the cache. returns the full path in the cache
# (downloading to a specific location is also possible)
url = 'http://pictures.xbox-scene.com/hsdemonz/smartxxsolderlesadapter/adapterfronthi.jpg'
filename = cm.urlretrieve(url)
print filename

at least from my point of view phuncks cachemanager.py could be the ultimate http fetch class.
what do you as a scriptwriter think? please make


great work phunck  :kickass:

btw: phunck your htmlscrub.py is also quite useful for any script handling html
Reply
#26
thanks for the feedback, bernd Smile

Quote:-maybe the class could be renamed to something more http releated like 'cachedhttp', 'httputils', 'fetchex' or <yoursuggestionhere>
its just that cachemanager doesn't instantly brings http, fetch functionality to my mind.
yes, i think the class should be renamed (maybe to httpfetcher). "cachemanager" is not really very descriptive.

Quote:currently there is no method to clear the entire cache.
yes there should be some way of doing that. there is a .cleancache() method, but that only expires old data. maybe it could have an optional argument that expires everything older than xxx. you could then set that to 0. ?


Quote:currently cachemanager.py has a progressdialog build in.
i think it would be great to make this more customizable. maybe a memberfunction ondataretrieved(percentage, bytesreceived, bytestotal) which can be overloaded to customize the progress. i just don't want to force scriptwriter to have a progress dialog.
i agree... i think it would be nice with a call back or if you pass the progress dialog. but also i would like the current behaviour to be default, so that it is easy if you don't have more complicated requirements.

Quote:-whould be great if one don't has to call createfolders explicitly and the class would do it automatically.
yes but then the folder path's should be specified in the constructor... i'll probably make it like that.


Quote:timeout for reading data (function should return if the server hasn't send any data for a specified amount of time)
couldn't test how it reacts when the server starts and then stops sending data. can anybody point me to an unreliable server?
i've seen it timeout in the middle of a download (comics.com is quite unreliable.) so, the sockettimeout works for both kind of timeouts apparently.

------------------
a todo list:
* cookies (unfortunately cookielib isn't in python2.3). last example here would be extremely easy to implement in my code: http://docs.python.org/lib/cookielib-examples.html
* flexible progressbar method
* auto create folders
* allow post data
* a method to force a cached retrieve even if server says that it is time to get a new one.
* debug what happens if you say download permanently and you already have it in the cache. (i have a feeling there is a bug).
* figure out a way of distributing this. (i'm hoping for something like scriptrunner to implement some functionality that ensures pil installation and things like this. *hint* Smile http://www.xboxmediaplayer.de/cgi-bin....t=12217 )
* should all scripts share the same cache? it sort of makes sense, except 2 scripts wouldn't visit the same site anyway (probably).
Reply
#27
i think that regex buddy looks great. (i'm testing my own regexps using ultraedit, but that is not optimal)

i suppose you know this link, but here it is anyway. it is not for learning but it is a very useful reference. especially since python regexps are more powerful than those in many other languages.
http://docs.python.org/lib/re-syntax.html
Reply
#28
Quote:i agree... i think it would be nice with a call back or if you pass the progress dialog. but also i would like the current behaviour to be default, so that it is easy if you don't have more complicated requirements.
my intention was to keep gui and html stuff separated. i like to develop and test most of my stuff on pc without xbmc. but maybe thats just me.
wouldn't an overloadable ondataretrieved function be the more python way? this way the user can also customize the message etc.
but no need to discuss this minor in legth here.

Quote:yes but then the folder path's should be specified in the constructor... i'll probably make it like that.
i don't think it must be passed in the c'tor. but the default in the ctor should be the cachefolder for all scripts.

Quote:i've seen it timeout in the middle of a download (comics.com is quite unreliable.) so, the sockettimeout works for both kind of timeouts apparently.
this socket timeout thing is a very elegant solution, because it doesn't require any threads!

Quote:* should all scripts share the same cache? it sort of makes sense, except 2 scripts wouldn't visit the same site anyway (probably).
i also think it would make sense (see above)
btw can't the cache folder be placed somewhere on x:, y: or z:? aren't these partitions for temp file use?

bernd
Reply
#29
Quote:can't the cache folder be placed somewhere on x:, y: or z:? aren't these partitions for temp file use?
they are for cache, but i wonder if it is only for game cache. does anybody have any experience with this? but a general logical location like that would be good. but maybe just q:\scripts\httpfetchcache\ ?

the permanent folder should probably be specified on a script by script basis. e.g. sometimes f:\videos\musicvideos might make sense.

Quote:my intention was to keep gui and html stuff separated. i like to develop and test most of my stuff on pc without xbmc. but maybe thats just me.
wouldn't an overloadable ondataretrieved function be the more python way.
that is a nice way of doing it! i'll do that.

btw. i also test on the pc using alexpoets emulator. i just added these lines (to xbmcgui.py) to make it work:
Quote:class dialogprogress:
def create(self, title, line1, line2 = "", line3 = ""):
print('xbmcgui.progress.created:'+line1)
def update(self, pct):
print('xbmcgui.progress:'+str(pct))
def iscanceled(self):
return false
def close(self):
pass
Reply
#30
enderw: you mentioned clientcookie. i can see that this is equivalent to cookielib in py2.4, so this is what i need. however, can i rely on clientcookie being installed?
Reply

Logout Mark Read Team Forum Stats Members Help
Looking for the ultimate HTTP fetch function0