Kodi Community Forum

Full Version: MusicBrainz & extending the MusicScanner/Scraper
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
Hi,

Noticed MusicBrainz support is sort of implemented with the MB tags cached in CSong. I was taking a look into writing an MB scraper and I noticed that the scrapers only get the Artist/Album as strings. I took a look at the code behind the MusicScanner & MusicScraper and wondered if you would be open to discussion & patches to the music scraper?

I looked at a couple of approaches to propogate MBIDs to the scrapers in the code, but my thought finally was to do the following - open to feedback:
1) I would propose to add the fields MusicBrainzArtistID -> 'artist'/CArtist, MusicBrainzAlbumID & MusicBrainzAlbumArtistID -> 'album'/CAlbum in MusicDatabase->AddSong. While I was writing this I realised this has a secondary advantage - later we can disambiguate similar named Artists/Albums etc using the strArtist & MBArtistID as the key on Artist. But in the first instance the proposal is just to propogate these fields to the Artist/Album to keep it simple. There would be issues around similar named artists but I think those same issues exist today so we're not breaking anything existing.

2) For MusicInfoScanner and MusicDatabase:
My proposal is to pass a CAlbum, CArtist, etc to DownloadAlbumInfo/DownloadArtistInfo and downstream functions instead of fixed strings strAlbum and strArtist. This way any members of the CAlbum & CArtist structs can be passed to scrapers. Later if we extend those structs it is easier to get that data to the scrapers (no refactoring the function params each time). For those source functions calling e.g. DownloadArtistInfo with only a string we'd first create a CArtist, and populate it with all of the data we have in hand. The scrapers could get more of the data available to us in order to make less amiguous decisions. With MBIDs in CArtist & CAlbum as above, we can now easily pass the scanned MBIDs to the scraper.

I have working patches to do 1) and I'm just working on part 2) of the change as a proof of concept. I notice also the MBID tags on my m4as aren't being scanned in right now, I will check that out too.
Finally this would require additional scrapers that key off MBID not 'Artist Name' etc.

I think all of this would be great for fanart.tv which already seems to key of MBID, artist disambiguation, etc.
Wanted your feedback on whether this would be something useful and considered/reviewed for inclusion?
Thanks in advance for looking.
Sounds like a decent plan to me. I think to be completely useful we'd also need to compute/lookup the mbid for songs that don't have it, as my guess is that 99% plus of users would not have that information available in their tags. But ofcourse, that is something separate.
Sure. I have some ideas around that part as well, but I don't want to get ahead of myself right now. If folks are okay with the approach - let me complete the first part and post a link to the git for review/testing soon.

Best Regards.
I haven't read through your ideas but all I know is that olympia (who also writes and maintains the TVDB, TMDB, IMDB and AllMusic scraper) is currently working on a MusicBrainz scraper (or maybe has already finished) it. So maybe you wanna contact him so you guys don't do similar work twice.
(2012-05-28, 09:14)Montellese Wrote: [ -> ]I haven't read through your ideas but all I know is that olympia (who also writes and maintains the TVDB, TMDB, IMDB and AllMusic scraper) is currently working on a MusicBrainz scraper (or maybe has already finished) it. So maybe you wanna contact him so you guys don't do similar work twice.

Artist scraper is working.
Album part is still being worked on.


Being able to store the MBID in in database table(s) (like artist, album group-release or/and release itself, song, record label) would be very useful IMO.
So you get a big +1 from me Smile
(2012-05-28, 09:14)Montellese Wrote: [ -> ]I haven't read through your ideas but all I know is that olympia (who also writes and maintains the TVDB, TMDB, IMDB and AllMusic scraper) is currently working on a MusicBrainz scraper (or maybe has already finished) it. So maybe you wanna contact him so you guys don't do similar work twice.
released today:
MusicBrainz Album Scraper

metadata.album.musicbrainz.org
metadata.common.musicbrainz.org

I've made the first set of changes to the MusicScanner & DB required to get this going.

They're a little kludgier than I would have preferred as I had a couple of misconceptions around how the scrapers & scanners worked and I'm working round a few things. I think if this works out I could offer to work on some housekeeping in the (music) scanner & scraper code to make it a bit cleaner. I have some ideas to add AcoustID support later but I'm thinking the current code might want some TLC first? One thought is to more fully integrate the NfoReader as a proper scraper into the scraper code at GetScraperForPath instead of in the individual Album/Artist downloaders.

Forked repo is here. Open to feedback.
https://github.com/night199uk/xbmc
Scrapers done.
https://github.com/xbmc/xbmc/pull/1027

I'm chatting offline with olympia about whether he would be interested to write the actual scraper code for this. I will start on MBID parsing for m4a/AACs as my music is mostly m4as with MBID tags.

Regards.
The initial patch I've provided should take the first (primary) artist and album artist.

Looks like in NGS Picard can now write multiple artist and album artist tags for e.g. 'Mark Knopfler & Chet Atkins' using their credit system.
This doesn't fit well with the idArtist & strExtraArtists schema currently in use (unless we do something like strMusicBrainzArtistID/strMusicBrainzExtraArtistIDs).
To me the cleanest solution here would be to be able to make the song/album -> artist table links N:N.

I don't want to tread on the whole schema design conversation as that seems to be a huge thread - I don't want to try and boil the ocean.
My thought right now is to evaluate the work / pitfalls of converting the Album/Song -> Artist table mappings to an N:N mapping, and also importing/supporting things like the Compilation flag on MP3s for identifying Various Artist releases more easily.

My next step is to get the primary MBID data imported from M4As though.
They already are N:N. The strExtraArtists is only there to speed things up for the UI (denormalisation for speed). See the link tables exartistsong/album.
sure - exartistinfo seems to serve a different purpose though. we only take the first artist from the song and create an artistinfo. we create everyone else in the exartistinfo table. if i had to hazard a guess i would say this was added on later so as not to disturb the existing artist relationship but show featuring credits. it also seems to drive the 'Don't show artists that only appear on compilations' tag at a guess?

my thought is removing 'idArtist' from Song and 'idAlbumArtist' from albums and promoting them to link tables. in the 'Mark Knopfler & Chet Atkins' example above Chet Atkins isn't a featured artist, he's a primary artist but i don't think we can support that today. for the 'don't show artists that only appear on compilations' tag my thought would be to use the 'compilation' tag in the songs. i don't think we write that to the db today but i feel we should. this is basically how musicbrainz works. iTunes also sorts its music database this way and with musicbrainz and itunes today the workflow is very natural (to me) - i know not everyone uses the same workflow though. i would re-introduce the 'strArtist' tag read from the file itself as this is the 'genuine' names of the artists as they are presented to the user, e.g. 'Mark Knopfler & Chet Atkins featuring Someone'. the individual artist tags just help to index the tune correctly to show under the relevant artists.

i think that's a side issue though - as with the discussion on github the first challenge seems to be that the we're heavily dependant on the std::vector<string> in CAlbum, etc. this means we only ever use the artist name as the key into the artist table and forego the quality 'idArtist' unique we get when we create the song. :-( I guess it's historical. i think there is some foundation work i propose to do here first to get non-unique named artists & albums to work. otherwise when we introduce MBIDs onto the tables there is a risk of associating MBIDs with the wrong artist/albums:

- in the first instance i would create accessors on CAlbum/CArtist/CSong for artist, album, etc. std::vector<string> GetArtists / SetArtists(std::vector<string> artist).
- replace everywhere in the code we do a StringUtils::JoinString(album.artist, g_musicseparator) with a call to the accessor
- change all the fields from std::vector<string> to std::vector<long> e.g. std::vector<string> artist in CAlbum

this change in itself has no effect on behaviour day 1. the accessors will deal with resolving the key to the artist string. this means some db related code in the CArtist accessors :-( an alternative would be to create an std::vector<string> GetArtistsByID(std::vector<long> artist) in MusicDatabase. I prefer the idea of field accessors personally, and we could remove the DB calls from CArtist/CAlbum later when we get to our destination.

with the accessors in place we can start to work through the functions that call the accessors - if they are internal code that could be converted to use the idArtist correctly then i'll convert them to look up the artist by the ID. if they are serialization or presentation code and genuinely need the artist name etc - then they shouldn't change but we can start to change the accessors to add disambiguation information from CArtistInfo/CAlbumInfo to the generated strings to make the string based lookups more unique.

thoughts? my approach is to take this in small easy digestable chunks that can easily be PRd and integrated. i've seen a bunch of 'redesign the world' schema threads that have been aborted and my thought is not to go that route.
CArtist/CAlbum are go betweens for the UI. You do not want any database stuff in them at all.

CArtist should need only hold a single id, right? Unless we're treating "Mark Knopfler & Chet Atkins" as a collection of artists that would somehow be represented as a single CArtist with separate ids? How does MusicBrainz work in this regard? Do we need an artist ID <-> artist ID link table as well?

Perhaps the best method, then, is for CAlbum (and CSong) to hold a vector< pair<int, string> > for artists/album artists. Using accessors to get the full string seems good to me.

Given this, we'd need to redo the db layout. The first step would be to switch exartistsong (et. al.) to be all artists for the song, rather than just the extra artists - this would simplify the logic a little on fetching songs by artist I think.

Then you can drop idArtist/idAlbumArtist from CSong/CAlbum and replace the strExtraArtists stuff with a denormalised form of the vector< pair<int, string> > for speed.

The MBID stuff needs to be in the appropriate table as well (i.e. no artist/album info for musicbrainz in the song table) - we don't need this at all in the UI imo - it should not be required to be retrieved at any point for the UI, though I have no problem with the song-specific stuff being retrieved in the song queries.

This is quite a bit of work, but I think it's the only way it's really going to work out well, and I don't think it negates any move to another db layout later on - after all, the current musicdb layout is pretty good other than this.

And yes, please keep commits small and self-contained - let's achieve it slowly but surely Smile

Cheers,
Jonathan
I think we're agreed. :-) I have a much better patch coming through now. I went a slightly different way and created a new table album_artist rather than trying to reuse exartistinfo. I'm gradually building out new and then I will demise the old in the patch so that I keep everything working at each stage of the patch. Much smaller commits too now I have the hang of git.

Regarding album->artist this is what I found - CAlbum (today) stores artist as std::vector-string-, but I cannot find anywhere in the code where we actually really support multiple artists except the DB, XML and JSON APIs. None of the tag readers support reading multiple artist tags. So my argument is that album->artist should really be converted back an CStdString, and this would greatly simplify the code. I propose that artist credits are really just a database side things for indexing (so when I hit 'Mark Knopfler' I see all tracks where 'Mark Knopfler' contributed). Everywhere else we need to use album->artist as a string, to get the artists credits as originally written on the track. This is basically how MusicBrainz works. We store a string for the label and tag the string with the relevant artists. Here's a good example: http://musicbrainz.org/artist/e49f69da-1...dcb0e588f5

So basically everywhere in the code today where we do:
StringUtils::Join(album.artist, g_musicitemseparator) should use strAlbum (the original tag read from the file).

Everywhere we use album.artist in a vector context should use a new DB function:
std::vector_long_ GetArtistsForAlbum(idAlbum, bool includeFeatured)

Those places will be fixed to use the ID and lookup the artists by ID rather than String as they do right now.

Now I will write one master 'artist string processor' for non-MB users which deals with as many of the type of featuring credits as I can find:
'Mark Knopfler & Chet Atkins feat. Dizzee Rascal & Emmy Lou Harris' or whatever :-)

And that will be called from AddSong (and temporarily the XML & JSON APIs to maintain compatibility) to create the relevant entities in the DB. If we find MusicBrainz tags in the file we will skip doing this step and just create the artist entities with the MBIDs. The one down side of this is that we would create a bunch of artists with empty strArtist but MBIDs and then call the MB scraper later. I don't think I'm concerned about this on reflection as the scraper runs in the InfoScanner right after we pull in the file... MBIDs will be more reliable than our artist string processor, as sometimes we get a single Artist or Group that has an & in the name ('Myon & Shane 54'). Later I propose to add an advanced settings tag that will let you specify a list of known artists that have an '&' in their name to override the artist string processor for non-MusicBrainz users.

Regards MusicBrainz, 'Mark Knopfler & Chet Atkins' are two artists with two IDs, but the album contains the string 'Mark Knopfler & Chet Atkins'.

See here:
http://musicbrainz.org/artist/e49f69da-1...dcb0e588f5

I'm going to pull out the original MB PR I raised and raise a new PR - apologies for the spam. I also want to move onto a branch in my git repository and I think this is a very different request now.

Once this is done for albums I will start to look at song_artist... :-)
We completely support multiple artist tags, but typically it's done via a separator in a single tag (eg " / " is the default one). Some of these tag formats support a list of strings for these fields, though typically it's a single string that's split using a separator. We don't really utilise it in any way other than to split the artists up for the db, so your plan sounds fine.

I'm not sure what the point of the new table is, but perhaps it will become clear once I see the code Wink

Cheers,
Jonathan
Problem with the splitting of artist is you can only use one seperator. At least supporting multiple would be better Smile
Pages: 1 2 3