How I intend to handle a library with 10,000 games
#1
A feature I'm currently working on is content add-ons. These are similar to python plugins, but content is scanned into the library instead of browsed through the add-ons node.

A quick perusal of the Internet Archive ROM Launcher plugin yields a delightful 10,390 games. If the "launching" part is separated from the "browsing" part, this plugin could easily become a content add-on that provides an instant game library, 10,000 games strong.

Unfortunately, the heavy torrenters out there know how slugish the library can be with multiple thousands of items. Clearly a new approach is needed.

Enter the unified content database. This database has been built to handle all library content - movies, tv shows, music, and now games as well. It can scale to tens of thousands of items and can handle binary large objects (BLOBs), so it may someday be able to synchronize fanart and game saves.

The existing libraries are built on SQL databases. The relational model offers several advantages; it is suitable for much of the data stored in the library, and simplifies development by providing strong ACID guarantees. However, when there is a large amount of highly-relational data, JOIN operations across large tables becomes prohibitively expensive.

The unified content database is backed by NoSQL solutions. All of a sudden, fetching an item, even one with a heavy amount of 1:N and N:M relations, becomes O(1). This fact is exploited to give incredibly fast read times, even as the library scales in size. Additionally, NoSQL is schema-less, so painful database updates are no longer required.

Unfortunately, lunch ain't free. The burden is shifted to more complex write operations. This is currently where most of my development is focused. Dropping ACID guarantees has been great for read speed, but those four little letters were actually quite helpful. So now I'm working on restoring the advantages that were lost in moving away from SQL.

When this is closer to being merged with RetroPlayer, I'll return with a more detailed explanation of the internals.

Currently, I've integrated two key-value stores appropriate for embedded use, LevelDB and Kyoto Cabinet. If there's enough interest, backend support could be extended to binary add-ons. That way, thin database clients could be created similar to PVR clients. They would communicate with a network backend, such as CouchDB or MongoDB, allowing for a synchronized database in the same manner as MySQL.

There's another feature I'm working on: unified content scraping. The idea is that the content database exposes a tree-like structure that can be browsed through a content:// url on the VFS. This expands to all content add-ons, as well as all file sources.

Initially, the leafs of this tree are simple URLs with little or no metadata/artwork. That's where Heimdall comes in (remember the GSoC 2012 project by topfs2?) I simply point Heimdall at the root of this tree and let it do its work. Heimdall runs in the background, "pruning" each leaf, filling in all the missing metadata fields/artwork. Under the hood, Heimdall is a simple inference engine using a set of Python rules to extract metadata. It builds on previous information, so any info provided by a content add-on (such as title, platform, etc) or scraped from the file (see PyRomInfo) is used to inform Heimdall's discovery process.

Finally, I come to the inevitable game library. The game library is defined by XML nodes similar to video library nodes. Library data is pulled directly from the unified content database.

Heimdall works in a data-driven manner, meaning that it processes leafs whenever data becomes available. For example, a PyRomInfo rule might discover a unique game ID embedded in the ROM, and Heimdall could then use that unique ID to perform a 1:1 match against a remote database without risking string-matching the filename against a set of similar titles. This is really cool, because PyRomInfo scraping is hundreds of times faster than pinging a remote web server, so Heimdall could decide to process ALL files with a PyRomInfo pass and defer the more expensive web lookups for later. When scraping local roms, this means that a rudimentary library is built within seconds, even for thousands of roms, and slowly embellished as more data becomes available.
RetroPlayer releases: https://github.com/garbear/xbmc/releases

Donations: eigendude.eth
Reply
#2
Sounds good to me. Couple of things:

- Will it work in the same way as other media types (scraping ect)
- If not, will it be scalable in the future to other media types
- Any new dependencies needed in this approach?
- Any disadvantages? (currently its nice to be able to open the database with sqlite viewer)
- Will this get merged separately at some point so others can test the code in the mainline?
Reply
#3
(2015-09-09, 15:53)zag Wrote: - Will it work in the same way as other media types (scraping ect)
- If not, will it be scalable in the future to other media types

Scraping is all done in python. Heimdall supports music and video, so we can drop our XML scrapers whenever we're ready

(2015-09-09, 15:53)zag Wrote: - Any new dependencies needed in this approach?

Requires a NoSQL backend. wsnipex helped me integrate Kyoto Cabinet and LevelDB into unified depends, so we currently have two choices.

(2015-09-09, 15:53)zag Wrote: - Any disadvantages? (currently its nice to be able to open the database with sqlite viewer)

No more sqlite viewer. It's just JSON (actually BSON) in a key-value store.

(2015-09-09, 15:53)zag Wrote: - Will this get merged separately at some point so others can test the code in the mainline?

I'll carry it in my RetroPlayer branch. The patch is independent, so it could be merged separately (though a game library isn't much use without a player)

To be clear, although this is a unified content database, I only intend to implement games. Video and music can come over when they're good and ready
RetroPlayer releases: https://github.com/garbear/xbmc/releases

Donations: eigendude.eth
Reply
#4
Have you wrapped the database access completely?

We'll probably need a way to replace triggers - haven't really checked what their used for now, but I know we have some. So I think a wrapper is needed, would also let us set modified date stamps.
Reply
#5
So as it currently stands, the game database will never work with a MySQL backend for synchronisation, it will be local only until someone writes a database back end for it?
Reply
#6
I'm adding MongoDB support. it'll be networked if you have a mongodb server
RetroPlayer releases: https://github.com/garbear/xbmc/releases

Donations: eigendude.eth
Reply
#7
If possible add support for mysql as kodi already uses it for videos/music and that way headless server could server games too. Not sure what's with mysql future in kodi anyway.
Reply
#8
No, the unified content database won't support mysql. You'll need a NoSQL database. I'm adding drivers for MongoDB, LevelDB and Kyoto Cabinet
RetroPlayer releases: https://github.com/garbear/xbmc/releases

Donations: eigendude.eth
Reply
#9
I'd focus on just one implementation - other nosql drivers can be worked on after feature is merged
Reply
#10
All NoSQL are not created equal. There are two types of NoSQL databases that fit our problem: Key-value stores and document stores. LevelDB and KyotoCabinet are key-value stores. MongoDB is a document store.

Key-value stores have a simple api - Put(), Get() and Delete(). Document stores are built on key-value stores, but they have one additional method - GetByProperty()

The difference is, with a key-value store, you get one library node - "All Games", because you can't filter by property. With document stores, you get a library with "Platforms", "Publishers", "Genres" nodes, etc, because you can filter by these properties.

So which implementation do we go with? Clearly, GetByProperty() is needed. The difference between Get() and GetByProperty() is a lot of complexity. This complexity needs to go somewhere.

MongoDB is the obvious choice. It removes all the complexity from Kodi. It can do GetByProperty() and a bunch of other stuff. In fact, it's so complex that you can't embed it. What if Kodi dropped SQLite, and required everyone to use MySQL?

I'm taking the same approach I did with libretro. Same situation - libretro is a massive API, written in C, so you get complexity from the size AND from the callbacks-of-callbacks nightmare required for C. Recall my solution:

Image

The purpose of the libretro wrapper is to encapsulate all the complexity, so that Kodi only has to deal with a clean, simple Game API.

Likewise, I'm writing a wrapper that implements GetByProperty() on top of a key-value store. This means that we get all the benefits of an embeddable key-value store, and all the complexity stays hidden from Kodi.

Naturally, I'm taking the test-driven approach. I'm implementing MongoDB as a document store. Then I'm writing test cases. Then, I replace the document store with a key-value store and my custom wrapper. The end goal looks exactly like the picture above: replace "Game add-on" with "Document store", "Libretro add-on" with "Key-value store", and "Libretro wrapper" with GetByProperty()
RetroPlayer releases: https://github.com/garbear/xbmc/releases

Donations: eigendude.eth
Reply
#11
Does this mean that the effort to design a new DB structure worked on by m.savazzi and others will be abandoned?

See http://forum.kodi.tv/showthread.php?tid=200911

Maybe some of his research from that thread and gathered elseware could be reused anyway?

(2015-09-09, 15:53)zag Wrote: will it be scalable in the future to other media types
+1 for also considering adding a pictures library, as per http://forum.kodi.tv/showthread.php?tid=141169
Reply
#12
(2015-09-10, 15:50)RockerC Wrote: Does this mean that the effort to design a new DB structure worked on by m.savazzi and others will be abandoned?
Well, with a noSQL store you don't need the normalised schema that m.savazzi created, but the work is not lost IMO, as this normalized schema can just as well be used as a template for objects sharing the same interface and inherit from one another, they're are just not stored in a SQL store but noSQL. Also, the schema gives a very good overview on which parts could probably be implemented as microservices.

@garbear
I know what you mean, but I'm not sure a interface is enough. You might also need to make use of some sort of QOM that each client translates into it's native format. So using a persistence interface in combination with QOM would IMO be the way to go. It would even give the possibility to write a MySQL client for it or bind it to a remote webservice.

Code:
----- Pseudo code ---------------
storage = PersistenceLayer::Get( )->getStorageForObject('movies');
storage->add(randomNewMovie);
storage->update(existingMovie);
storage->delete(existingMovie);

query = storage->createQuery();
query->setLimit(20);
       ->setOrderings('{dateAdded:ASC, name:DESC}')
       ->matching(
          query->logicalAnd(
               query->equals('genre', 'Action'),
               query->isGreaterThan('releaseDate', '20130101'),
               query->logicalOr(
                   query->equals('watched', FALSE),
                   query->equals('progress', 0)
               )
          )
       );

resultStorage = query->execute();
Reply
#13
(2015-09-10, 15:50)RockerC Wrote: Does this mean that the effort to design a new DB structure worked on by m.savazzi and others will be abandoned?

by no means should they slow down their progress. this nosql work is largely experimental and may not be merged. let's wait for some speed tests first.

my hypothesis is that sql scales poorly on constrained devices like the RPi. I think we haven't seen this because our movie and tv show libraries haven't gotten that big yet.

sure, you might be addicted to torrenting like crack. but your libraries are never gonna grow to contain 20,000 items. Under the assumptions of this crazy forumula, if the average video file time is around 40 minutes, and the quality is slightly better than the typical cam, you're looking at 14 TERABYTES of video.

games and pictures are different for two reasons. i'm creating scannable game plugins, so you can scrape the entire internet archive (10,300 titles) at once. and picture libraries can number in the hundreds of thousands. in both cases, there is a whole more data being dealt with.

ofc, the PRi could be constrained by the memory needed to display 10,000 items, and the database technology could make no difference. once I gather some numbers, we'll know for sure
RetroPlayer releases: https://github.com/garbear/xbmc/releases

Donations: eigendude.eth
Reply
#14
(2015-09-10, 16:20)da-anda Wrote: @garbear
I know what you mean, but I'm not sure a interface is enough. You might also need to make use of some sort of QOM that each client translates into it's native format. So using a persistence interface in combination with QOM would IMO be the way to go. It would even give the possibility to write a MySQL client for it or bind it to a remote webservice.

you're right, an interface isn't enough. not sure what QOM stands for, but i was going to write a query method based on our CVariant class in c++. Anything other than the data's native format will probably be too slow.

(2015-09-10, 16:20)da-anda Wrote:
Code:
----- Pseudo code ---------------
storage = PersistenceLayer::Get( )->getStorageForObject('movies');
storage->add(randomNewMovie);
storage->update(existingMovie);
storage->delete(existingMovie);

query = storage->createQuery();
query->setLimit(20);
       ->setOrderings('{dateAdded:ASC, name:DESC}')
       ->matching(
          query->logicalAnd(
               query->equals('genre', 'Action'),
               query->isGreaterThan('releaseDate', '20130101'),
               query->logicalOr(
                   query->equals('watched', FALSE),
                   query->equals('progress', 0)
               )
          )
       );

resultStorage = query->execute();

Can you post some more example queries that would we see from a typical smart playlist?
RetroPlayer releases: https://github.com/garbear/xbmc/releases

Donations: eigendude.eth
Reply
#15
(2015-09-10, 16:28)garbear Wrote: my hypothesis is that sql scales poorly on constrained devices like the RPi. I think we haven't seen this because our movie and tv show libraries haven't gotten that big yet.

sure, you might be addicted to torrenting like crack. but your libraries are never gonna grow to contain 20,000 items.

A music library with 20k tracks is actually quite common, and I've see reports from users with *far* larger music libraries.
Performance is very bad (e.g. a minute to play a song) and not just on Pi, I've seen similar complaints on PCs.

However I'm not sure this is all down to sqllite. I think depending on the view used and the way the file is played Kodi may create a playlist of every song in the current node (which is the whole library for the tracks node) which takes forever for 20k tracks.

But if you are considering sqllite performance for large libraries, then music libraries have the potential to be much larger than TV and movies.
If your database solution can improve the music library performance, then that would be valuable.
Reply

Logout Mark Read Team Forum Stats Members Help
How I intend to handle a library with 10,000 games0