custom polish music scraper - help and bug reports wanted!
#1
hey,

i did for myself hybrid english\polish scraper

mostly i use code from xbmc existing scrapers

allmusic - for generic info [english] - thx spiff
merlin.pl - for album review [polish]
lastfm - for artist biography [polish] - thx spiff

http://smuto.w.interia.pl/allmusic_merlin_lastfm.xml

smuto
Reply
#2
Smile
Reply
#3
hey again
next custom polish music scraper
http://smuto.w.interia.pl/xbmcpicard.xml

album search - musicbrainz.org
album review - merlin.pl
album generic info - allmusic.com

artist search - lastfm
artist discography - musicbrainz.org
artist biography - lastfm
artist generic info - allmusic.com

diacritics replace for allmusic search by my own php tool

smuto
Reply
#4
i need to ask
is xbmc supports MBID when scrobbling a song?

mayby we can add MBID to the buffers passed to scrapers?
CreateAlbumSearchUrl: $$1 = album title, $$2 = artist title, $$3 = album mbid
CreateArtistSearchUrl: $$1 = artist title, $$2 = artist mbid

take a look
http://musicbrainz.org/ws/1/artist/$$2?type=xml&name=$$1

and now my problem
artist = "The The"
with empty mbid
http://musicbrainz.org/ws/1/artist/?type...me=The+The
or with
http://musicbrainz.org/ws/1/artist/a7409...me=The+The

and no more problems with auto scan
smuto
Reply
#5
good idea. we do extract them so they should be passable. trac it please.
Reply
#6
more questionsNerd

Is there a way to fill tracks info
<track><position>\1</position><title>\2</title><duration>\3</duration></track>

from this xml
http://musicbrainz.org/ws/1/release/f2b8...ase-events

position - mayby repeat index
duration - milisecends
for now i open a new html page

allmusic scraper common requests
whether we can move a tracks info from ParseAMGAlbum to GetAMGReview?

smuto
Reply
#7
tricky to get sane info from that page. you can most definitely split parseamgalbum if you find it advantageous.
Reply
#8
If you are still looking for loading track infos from xml, you can try using something like:

!!! WARNING !!! I wrote this without much knowledge about your scraper (only checked it briefly) and without any testing in XBMC, so i guess there are some errors, but in theory it should work (i used similar constructions in my scraper) Wink

Code:
<xxxxxxx clearbuffers="no" .....>
    .
    .
    .

    <!-- Variables:
       - $$20 ... buffer for gathering track infos (new info is appended to end of buffer)
       - $$19 ... current track id
       - $$18 ... current track position
     -->  
    <RegExp input="" output="\1" dest="20">
      <expression/>
    </RegExp>
    <RegExp input="$$1" output="\1" dest="19">
      <expression>&lt;track id=&quot;([0-9a-f\-]+)&quot;</expression>
    </RegExp>
    <RegExp input="1" output="\1" dest="18">
      <expression/>
    </RegExp>
    
    <!-- url call to GetXMLTitleList - empty (cleared) in case there are no tracks -->
    <RegExp input="$$19" output="&lt;url function=&quot;GetXMLTitleList&quot; cache=&quot;yourcache.xml&quot;&gt;http://your.xml.source.url/&gt;" dest="9">
      <expression clear="yes">[0-9a-f\-]+</expression>
    </RegExp>

    <!-- url added to your details buffer -->
    <RegExp input="$$9" output="\1" dest="8+">
      <expression noclean="1"/>
    </RegExp>

    .
    .
    .
  </xxxxxxx>

  <GetXMLTitleList clearbuffers="no" dest="4">
    <RegExp input="$$8$$9" output="&lt;details&gt;\1&lt;/details&gt;" dest="4">
    
      <!-- Parsing track info, appended to $$20 -->
      <RegExp input="$$1" output="\1" dest="6">
        <expression clear="yes" noclean="1">(&lt;track\s+id=&quot;$$19&quot;.*?&lt;/track&gt;)</expression>
      </RegExp>
      <RegExp input="$$6" output="&lt;track&gt;&lt;position&gt;$$18&lt;/position&gt;&lt;title&gt;\1&lt;/title&gt;&lt;duration&gt;\2&lt;/duration&gt;&lt;/track&gt;" dest="7">
        <expression clear="yes">&lt;title&gt;(.*?)&lt;/title&gt;.*?&lt;duration&gt;(\d*?)&lt;/duration&gt;</expression>
      </RegExp>
      <RegExp input="$$7" output="\1" dest="20+">
        <expression noclean="1"/>
      </RegExp>
  
      <!-- Parsing next track id -->
      <RegExp input="$$1" output="\1" dest="19">
        <expression clear="yes">\Q$$6\E\s*&lt;track id=&quot;([0-9a-f\-]+)&quot;</expression>
      </RegExp>
        
      <!-- Track number + 1 -->
      <RegExp input="1-2;2-3;3-4;4-5;5-6;6-7;7-8;8-9;9-10;10-11;11-12;12-13;13-14;14-15;" output="\1" dest="18">
        <expression>$$18-(\d+);</expression>
      </RegExp>      
    
      <!-- Only one variable should be filled
         - $$8 in case there are still unprocessed tracs -> another cycle over GetXMLTitleList
         - $$9 in case you are at the end of list -> returns all gathered track infos from $$20
      -->    
      <RegExp input="~$$19~" output="&lt;url function=&quot;GetXMLTitleList&quot; cache=&quot;yourcache.xml&quot;&gt;Should be already cached&lt;/url&gt;" dest="8">
        <expression clear="yes">~[0-9a-f\-]+~</expression>
      </RegExp>
      <RegExp input="~$$19~" output="$$20" dest="9">
        <expression clear="yes">~~</expression>
      </RegExp>
      
      <expression noclean="1"/>
    </RegExp>
  </GetXMLTitleList>

But i'm not sure if you can call it "sane" :p
Reply
#9
"sane" is position & duration

position is just tricky, but i also need change format of duration
from
<duration>246826</duration>
to
<duration>4:07</duration>

thx for your help
Reply
#10
it would need code support to allow something ala <duration format=".."></duration>
Reply
#11
If you want to pull some stunts you can use this Rolleyes

Code:
<RegExp input="$$1" output="\1" dest="20">
    <expression>&lt;duration&gt;(\d+)&lt;/duration&gt;</expression>
  </RegExp>

  <!-- 24xxxx minutes+seconds mixed together -->
  <RegExp input="" output="0" dest="19">
    <expression/>
  </RegExp>
  <RegExp input="$$20" output="\1" dest="19">
    <expression>(\d+)\d{4}</expression>
  </RegExp>

  <!-- xx6xxx second seconds digit -->
  <RegExp input="" output="0" dest="18">
    <expression/>
  </RegExp>
  <RegExp input="$$20" output="\1" dest="18">
    <expression>$$19(\d)\d{3}</expression>
  </RegExp>

  <!-- xxx8xx milliseconds part for rounding +0/+1sec -->
  <RegExp input="" output="0" dest="17">
    <expression/>
  </RegExp>
  <RegExp input="$$20" output="1" dest="17">
    <expression>$$19$$18([5-9])\d{2}</expression>
  </RegExp>

  <!-- 24xxxx divided by 6 -> minutes -->
  <RegExp input=";0=0,1,2,3,4,5,;1=6,7,8,9,10,11,;2=12,13,14,15,16,17,;3=18,19,20,21,22,23,;4=24,25,26,27,28,29,;5=30,31,32,33,34,35,;6=36,37,38,39,40,41,;7=42,43,44,45,46,47,;8=48,49,50,51,52,53,;9=54,55,56,57,58,59,;10=60,61,62,63,64,65,;11=66,67,68,69,70,71,;12=72,73,74,75,76,77,;13=78,79,80,81,82,83,;14=84,85,86,87,88,89,;15=90,91,92,93,94,95,;16=96,97,98,99,100," output="\1" dest="16">
    <expression>;(\d+)=(?:\d+,)*?$$19,</expression>
  </RegExp>
  <!-- 24xxxx modulo 6 -> first seconds digit -->
  <RegExp input=";0=0,6,12,18,24,30,36,42,48,54,60,66,72,78,84,90,96,;1=1,7,13,19,25,31,37,43,49,55,61,67,73,79,85,91,97,;2=2,8,14,20,26,32,38,44,50,56,62,68,74,80,86,92,98,;3=3,9,15,21,27,33,39,45,51,57,63,69,75,81,87,93,99,;4=4,10,16,22,28,34,40,46,52,58,64,70,76,82,88,94,100,;5=5,11,17,23,29,35,41,47,53,59,65,71,77,83,89,95,
" output="\1" dest="15">
    <expression>;(\d+)=(?:\d+,)*?$$19,</expression>
  </RegExp>
  <!-- seconds ($$15$$18) +0/+1 by milliseconds rounding -->  
  <RegExp input=";0=$$15$$18-$$15$$18,;1=00-01,01-02,02-03,03-04,04-05,05-06,06-07,07-08,08-09,09-10,10-11,11-12,12-13,13-14,14-15,15-16,16-17,17-18,18-19,19-20,20-21,21-22,22-23,23-24,24-25,25-26,26-27,27-28,28-29,29-30,30-31,31-32,32-33,33-34,34-35,35-36,36-37,37-38,38-39,39-40,40-41,41-42,42-43,43-44,44-45,45-46,46-47,47-48,48-49,49-50,50-51,51-52,52-53,53-54,54-55,55-56,56-57,57-58,58-59,59-60," output="\1" dest="14">
    <expression>;$$17=(?:\d+-\d+,)*?$$15$$18-(\d+),</expression>
  </RegExp>
  <!-- minutes +1 in case we reached full minute by rounding -->  
  <RegExp input=";60=0-1,1-2,2-3,3-4,4-5,5-6,6-7,7-8,8-9,9-10,10-11,11-12,12-13,13-14,14-15,15-16,16-17,;$$14=$$16-$$16," output="\1" dest="13">
    <expression>;$$14=(?:\d+-\d+,)*?$$16-(\d+),</expression>
  </RegExp>
  <!-- seconds 00 in case we reached full minute by rounding -->  
  <RegExp input=";60=00;$$14=$$14;" output="\1" dest="12">
    <expression>;$$14=(\d+);</expression>
  </RegExp>

  <RegExp input="$$13:$$12" output="\1" dest="10">
    <expression/>
  </RegExp

But it's even more insane than last time Laugh
And i guess you can see drawbacks. Because it depends on generated number lines, it will work correctly only up to the numbers what you have in scraper (in this case 1000999 = 16:40.999).
From my point of view you should make official feature request (in Trac) for some kind of duration formating because it can be handy for others too.
But feel free use this piece of code if you want, it works somehow Rolleyes (only briefly tested). I hope i didn't any stupid mistake.
Reply
#12
LOL! ^ this guy's got stamina for sure Smile
Reply
#13
You can call it curiosity Wink
Reply

Logout Mark Read Team Forum Stats Members Help
custom polish music scraper - help and bug reports wanted!0