little trick for website scrapping with linux

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
linuxluemmel Offline
Donor
Posts: 872
Joined: Jun 2009
Reputation: 0
Location: Lucern / Switzerland
Post: #1
Hello to all ;-)

In the moment I do programm a epg-scrapper for linux ...
After many hours dealing with a mirrored website with tools
like sed / awk / grep I found a little trick that makes scrapping
with linux a easy job

- I use snarf to download a entire website to my local computer

snarf -m -q http://tv.search.ch searchlocal 2>/dev/null &

My first attempt was dealing direct with the html-code .... without great success ...
My first bash script looked like this .....

Code:
grep -A 3 Titel: tmp.html | tail -1  | sed 's/[ \t]*$//' | sed -e 's/^[ \t]*//' | sed "s/^ *//;s/ *$//;s/ \{1,\}/ /g" | sed "s/<b>\|<\/b>//g" | sed "s/<\/td>//g" | sed 's/- Laufzeit/\n/g' | head -1
  grep -A 10 Inhalt: tmp.html | grep -m 1 -A 15 " >" | head -10 | sed '1d' | sed -e 's/^[ \t]*//' | sed "s/<b>\|<\/b>\|<td>\|<tr>\|<br \/>\|<\/td>\|<\/tr>//g" | sed '/^</ d' | sed "s/^ *//;s/ *$//;s/ \{1,\}/ /g"
  grep -A 3 'Genre:' tmp.html | tail -1 | sed -e 's/^[ \t]*//' | sed "s/<b>\|<\/b>\|<td>\|<tr>\|<br \/>\|<\/td>\|<\/tr>//g"
  echo regie && grep 'Regie:' tmp.html | sed -e 's/^[ \t]*//' | sed 's/search="/\n/g' | tail -1 | sed 's/">/\n/g' | grep '^<' | tail -1

It was not easy to change the code .....

But I found a Solution that was right for my job.... lynx
It has a so called dump feature .... It displays the website inside a Terminal or stores the output to textfile ....

- A allready formated texfile can easy be searched with tools like grep
- I do not have to handle linebreaks or something other html related things

lynx -dump $line > tmp.html

All the text as it would be inside a gui browser is now inside the text-file
tmp.html

- With all formatings / linebreaks and everything ....



Regards form switzerland
Hans
(This post was last modified: 2010-05-25 22:45 by linuxluemmel.)
find quote
Buttink Offline
Senior Member
Posts: 156
Joined: Feb 2010
Reputation: 2
Post: #2
Thats so hardcore its kinda scary
find quote