[RFC] Database Schema 4.0 - Request for Comments (Developers and DB Admins only!)
#61
smashmouse Wrote:Been using xbmc for a bit and thought I'd see if there is anything I can do to help with future evolution.

Regarding db schema design, i really wouldn't recommend using a EAV schema to model the whole logical data model. I think it's ok to model flexible attribute sets that maybe only used for display purposes. Maintenance and query design of such a schema will be a large overhead.

Has there been any thought of using a star schema to provide dimensional model of the data?

I've also been pushing back on a pure EAV/CR design, and firnsy agreed that a hybrid using explicit tables (e.g., movie) with flexible attribute sets attached to them as necessary (each table has a key and "extra" attributes can be attached via the table key and the table's own primary key) is a good compromise. My main issues with EAV/CR is that it breaks relational integrity (non-null columns, data type enforcement, indexes, foreign keys) and that it unnecessarily creates a database within a database and turns the application into a mini-DBMS, if you use it for columns that establish relationships, need sorting, etc. It's less bad as you say if they're only "display" columns (but still extra query overhead and lacks data type enforcement). I'd prefer to avoid EAV/CR altogether if there's no compelling case for it.

Can you explain more about where you think dimension tables would be useful? For the video database nullable columns seem sufficient for information that may not be available, and there are already separate tables for various relationships.
Reply
#62
dbrobins Wrote:Can you explain more about where you think dimension tables would be useful? For the video database nullable columns seem sufficient for information that may not be available, and there are already separate tables for various relationships.

Well, first looking at the requirements for use cases for XBMC.
1. The majority of queries is to return sets of data. E.g. Get all movies, Get All Movies that have genre of Action, Get All Movies that have actor = xyz etc.
2. For XBMC operation, majority of queries are selects. Due to the nature of the application, these need to be fast and snappy to not slow down the GUI.

A well defined dimensional model allows fast retrieval of data sets, drilling down on the data by the dimensions required. e.g. a dimension for party, e.g. Jessica Alba, a dimension for party role, e.g. Actor, a dimension for genre. The dimensions would be built with their own integer surrogate keys and have the relationships between them using fact tables, some of which would be factless facts (no logical data, just joining of dimensions). Data can be retrieved using fast indexes on the integer surrogate keys, only using dimensions that are required. OLTP schemas are not good for drilling down on data and the EAV schema is even worse due to being restricted to attribute/value pairs, where multiple dimensions would need to be modeled using over complex relationships.

This is how many datawarehouses are built when using architect design patterns set out by Ralph Kimball and used in the organisation I work for my day job. This is able to scan billions of records of data very fast, however this is running on a Netezza appliance.

There was a post stating that that there is pros and cons to the EAV schema. Flexibility is a pro, but query performance is a con. Query performance must be up high on the priority list for an application like XBMC. EAV requires modelling different entities in the same table space. The flexible schema design requires attributes to be define as data rather than column names. This will result in very inefficient indexes, having to index strings in large record sets of which there will be many duplicates and many will be discarded from the result sets where entities are not required.

My design principles is you can't design for every possible future eventuality, you can only design well for the day with what is known on the horizon. We shouldn't adopt a schema that doesn't model the data entities, this is shying away from the problem, and hiding it in a hard to maintain schema. As long as the schema is extensible to add to in the future without changing what is designed today, that should suffice for good design.
Reply
#63
smashmouse Wrote:
dbrobins Wrote:Can you explain more about where you think dimension tables would be useful?
Well, first looking at the requirements for use cases for XBMC.
1. The majority of queries is to return sets of data. E.g. Get all movies, Get All Movies that have genre of Action, Get All Movies that have actor = xyz etc.
2. For XBMC operation, majority of queries are selects. Due to the nature of the application, these need to be fast and snappy to not slow down the GUI.

A well defined dimensional model allows fast retrieval of data sets, drilling down on the data by the dimensions required. e.g. a dimension for party, e.g. Jessica Alba, a dimension for party role, e.g. Actor, a dimension for genre. The dimensions would be built with their own integer surrogate keys and have the relationships between them using fact tables, some of which would be factless facts (no logical data, just joining of dimensions). Data can be retrieved using fast indexes on the integer surrogate keys, only using dimensions that are required. OLTP schemas are not good for drilling down on data and the EAV schema is even worse due to being restricted to attribute/value pairs, where multiple dimensions would need to be modeled using over complex relationships.

How much have you looked at the current working schema? There already is an actors table, and a table that links it (many-to-many) to the movie table (actorlinkmovie). And there are other similar tables and relationships, although without explicit foreign keys.

Quote:
There was a post stating that that there is pros and cons to the EAV schema. Flexibility is a pro, but query performance is a con. Query performance must be up high on the priority list for an application like XBMC. EAV requires modelling different entities in the same table space. The flexible schema design requires attributes to be define as data rather than column names. This will result in very inefficient indexes, having to index strings in large record sets of which there will be many duplicates and many will be discarded from the result sets where entities are not required.

My design principles is you can't design for every possible future eventuality, you can only design well for the day with what is known on the horizon. We shouldn't adopt a schema that doesn't model the data entities, this is shying away from the problem, and hiding it in a hard to maintain schema. As long as the schema is extensible to add to in the future without changing what is designed today, that should suffice for good design.

I've made those exact same points here and offline: that EAV defeats indexes (and foreign key constraints, and even the simple requirement that a column be present and non-null), and that an EAV schema not only defeats those things which databases are good at, but also effectively reinvents a database in the application and makes it do work that a database does far better. It's not less work; it's just pushing the work to a place less suited.

I have put some schemas on the wiki - they're a variant/cross between the current schema and firnsy's earlier (less radical, if I may; less fully EAV) schema, with some incremental fixes like column renames and addition of foreign key constraints. I also started a page discussing design and relation between files and content. The approach there is top-down: start with the application needs (objects), and from there determine the database requirements. I may go ahead and see if I can make a patch that does some incremental changes, as firnsy encouraged earlier, e.g., column renames ('c##' to something more accurate), create missing foreign keys, etc.
Reply
#64
Quote:How much have you looked at the current working schema? There already is an actors table, and a table that links it (many-to-many) to the movie table (actorlinkmovie). And there are other similar tables and relationships, although without explicit foreign keys.

I have looked at the current schema and there also other link tables for managing the many to many relationship between actor and tvshows, musicvideos, epsiode etc. This would be much more efficient in a dimensional model. And would allow more simple queries in some use cases. e.g. selecting an actor from a movie could easily show where else the party has appeared in the library. e.g. an actor that may also be a director for a different movie or a music artist who also stars in a movie.

I had a look at the schemas proposed and they're certainly a step in the right direction for providing a level of abstraction.

It may be best to model the data first before trying to provide a physical schema. I need to get up to speed more with the current schema and data model but i may try and put a draft model of entities together as starting point, to detemine the level of abstraction and generalisation to create the concrete objects.
Reply
#65
I've been reading the proposed changes and am curious if there is any proposal on changes to the Class/Business layer? There can be changes made there that can help.
Reply
#66
Just wondering if there are any updates on the proposed EAV/CR schema.

I'd be willing to offer up some help if possible. I have 10+ experience with working with Oracle so I have a good grasp of data modeling and writing SQL.

Any help required with script writing, DDL, cleanup, etc?

**It would be great to see the Express versions of Oracle & SQL Server supported as well**
Reply
#67
I think the watched flag has to be within its own table, something like:

watched:
-idFile
-playCount
-lastPlayed

Another possible solution would be:

watchhistory:
-idFile
-date

This has several andvantages. No space is wasted for files that weren't watched yet. The second solution would also provide the ability to output some sort of watch history.

Furthermore it would be more easy to make the watch flag profile based, whereas all other meta information could be shared by all users/profiles. This isn't possible right now Sad.
Reply
#68
Question 
Regarding a database schema for music and music-videos, why not just start by using MusicBrainz's newly released NGS (Next Generation Schema) as is?

http://wiki.musicbrainz.org/Next_Generation_Schema

MusicBrainz Next Generation Schema (NGS) has been announced as a complete re-write of MusicBrainz database schema model and it was just announced a couple of weeks ago.

http://wiki.musicbrainz.org/?title=Serve...s/20110516

"MusicBrainz Next Generation Schema (NGS) is the culmination of five years of planning and two years of active development".

Also checkout MusicBrainz for music data mining in general because it contains loads of cool release meta data information like advanced relationship and much more.

http://musicbrainz.org/doc/Advanced_Relationship
http://wiki.musicbrainz.org/Release
http://wiki.musicbrainz.org/Release_Format
http://wiki.musicbrainz.org/Release_Event

http://musicbrainz.org/doc/MusicBrainz_Database

I read most of this stuff about their NGS on MusicBrainz blog.

http://blog.musicbrainz.org/

Bonus with reusing the MusicBrainz's NGS is that their professionals will keep that database schema up to date and fix bugs in it Smile

http://tickets.musicbrainz.org/browse/MBS
http://bugs.musicbrainz.org


Sorry if this was a dumb suggestion or if you are only looking at creating a database schema for handling videos. I am most curious if you will consider music and other pure audio files when you implement your new database schema for XBMC, so wondering what the status on this is today?
Reply

Logout Mark Read Team Forum Stats Members Help
[RFC] Database Schema 4.0 - Request for Comments (Developers and DB Admins only!)1