[LINUX] bash script for finding not-scraped movies
#1
Hi there,

I often heard about this problem in this forum and also needed something like this by myself so I wrote a little shell-script that will try to find mediafiles that are not yet in the library. This script is ugly and dirty but it works, at least for me.

The script does not change the database at all! It only queries the database and doesn't modify anything in it. The only files that are written are the ones described below.

This is work in progress. I'll try to improve it further and will also try to include some more specific database queries. If you have any suggestions, let me know.

Settings
Within the settings there are the following variables that you are urged to look at and change to values that fit your environment:
  • DBPATH points to the video-database (the file itself, not just the directory)
  • PREFIX will set a prefix that is used for all files that are created during the runtime of the script. It may include an absolute or relative path and a prefix of the filename. A good idea for this setting is something like the homefolder and a file-prefix or a temporary folder. Any given folder must already exist.
All needed programs should normally be installed already. If not, you need to install the corresponding packages and adjust the entry for the command, if needed. Most likely the command sqlite3 might not be available. The script might also work with older versions of the sqlite commandline tool but don't rely on this. The database that XBMC creates and uses has version 3. A short test with sqlite 2.817 resulted in an error while reading the database. So only change this value if the actual sqlite has an appropriate version.

Limitations
This script does not check whether any given entry in the library contains useful information. There might be entries that only contain the filename and a bookmark for example. These entries should also be considered as "missing in the library" but it's quit difficult to distinguish between items that were scraped but without getting useful information and items that didn't get any result at all and that are for a different reason ion the library. For conditions under which an entry is added to the library, consult the documentation.

Results
As result of this script there will be three files (each preceeded with a prefix, if set):
  • db-only.lst
    This file contains entries that are in the library only but not in the filesystem. If you want to get rid of them, usually a library clean-up (within XBMC) should do the job
  • fs-only.lst
    This is the most relevant file, as the files listed in it don't exist in the library but exist in the file system. There may be different reasons why they are not (yet) scraped. But at least now you know which files your library is missing
  • db-stacked.lst
    This file only shows up, which entries in the library are stacked entries, i.e. entries that are considered a single media item but consisting of multiple files.
There are four more files that are written while the runtime of the script and deleted afterwards. They are also affected by the PREFIX-setting.

Execution
All you need to do is copy the code to an empty textfile, adjust the settings to your needs, save this file and run it with the following command
Code:
sh <filename>
You might also set the executable-bit to the file and run it directly, just as you like.

Here we go:
[EDIT: the script is now available at the wiki only:
http://wiki.xbmc.org/?title=Linux-Script...ped_Movies ]
Reply
#2
Cool script.

I think that since most of us store our media in some kind of NAS, your script should be adapted to locate content within smb:// shares.

just my 2c.
Image
Reply
#3
Duduke Wrote:Cool script.

I think that since most of us store our media in some kind of NAS, your script should be adapted to locate content within smb:// shares.

just my 2c.
On Linux you mount stuff "locally", a bit like a network drive in Windows. So the script should be able to read stuff from your NAS with no issues at all.
ASRock ION 330, Lucid Lynx, XBMC Dharma (beta 2)
Reply
#4
Thanks for the script. Very usefull. I'll try it.
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#5
I moved the script and a better description to the wiki.
Reply
#6
Duduke Wrote:Cool script.

I think that since most of us store our media in some kind of NAS, your script should be adapted to locate content within smb:// shares.

just my 2c.
You're right in some way. As the path to that item would be stored as a smb:// path (just guessing as I don't use samba shares by myself) it cannot be resolved on commandline by my script. But it would be quite tricky to integrate samba shares with this script. At the moment I don't plan adding stuff like this.

If there are items in the library which are accessed by a special protocoll like smb (or ftp) you are likely to see them in the file db-only.lst as they are stored in the library but not in the filesystem. Please tell me how the entry looks like and I can see if it will be possible at all to handle that case.

Better would be, you try to really mount any shares on the network on your XBMC machine and access the contents as if it was locally stored. This works for classic NFS-shares as well as for SMB or even FTP and has a lot of advantages and no downsides for my knowledge.
Reply
#7
I'm experiencing problems with scrapping at the moment and noticed your script here. Just wanted to add that it is possible, and relatively easy, to highlight unscrapped movies via the built-in xbmc debuger.

1) Start XBMC
2) Enable debugging output System>Settings>Debug>Enable
3) Force a rescan of your directories
4) Exit XBMC and analyse log file.

Example of one movie not being found in tmdb for me atm:
Quote:cocjh1@xbmc:~$ grep InternalFindMovie .xbmc/temp/xbmc.log
23:08:48 T:3059698544 M:1881116672 DEBUG: InternalFindMovie: Searching for 'she's all that' using themoviedb.org scraper (file: 'tmdb.xml', content: 'movies', language: 'en', date: '2009-11-11', framework: '1.1')

I simply grepped for the key work InternalFindMovie above but alternatively open the log file in your favorite text reader and search for <results></result>, or GetSearchResults and read backward in the log...
Reply
#8
Thanks for this hint. Of course there are several ways to do the job. Grepping the logfile for some keywords is one way to do that. I will check if the proposed keywords are reliable and lead to an exact result. The downside of your way is that one has first to put XBMC into debug mode and then scan the media files. Therefore it is not that usable for people who just want to check from time to time if there are media items missing in the library.

I know that my script is far way from being perfect or at least universal. Therefore I'm actually thinking about writing a (python-)script which will do much more tests and might lead to better results.
Reply
#9
All,
I updated the script to work on a mac. I added a -d option for specify the database location. Also, I added the -o so the user can specify the output directory
As for some small bug fixes, I quoted most of the variable used, as files names with spaces would bomb the script
Can I/someone update the wiki with this new set of code?

-- Deathinator

Code:
#!/bin/bash
#
# XBMC Orphans and Widows
# v1
#
# modified 2009-12-10 deathinator
# added cmd line options -d,-o and ported to MacOS
#
# created by BaerMan for XBMC-community
# This script may be used for any purposes.
# You may change, sell, print or even sing it
# but you have to use it at your own risk!
#
# This script is ugly and may under certain circumstances crash your
# computer, kill your cat and/or drink your beer.
# Use it at your own risk!
#
# This script searches for media files (actually video files only) and
# checkes for
# 1) files that are not in the library
# 2) files that are in library only
# 3) entries in the library that are 'stacked' ones

################
### Functions ##
################
function usage
{
    echo Usage: "${0##*/} -d DBPATH  -o OUT_DIR"
    echo "See: http://wiki.xbmc.org/?title=Linux-Script_To_Find_Not_Scraped_Movies"
    echo "for more details"
}

function getDbPath
{
    system=`uname`
    if [ $system == "Linux" ]; then
    echo "$HOME/.xbmc/userdata/Database/MyVideos34.db"
    elif [ $system == "Darwin" ]; then
    # mac
    echo "$HOME/Library/Application Support/XBMC/userdata/Database/MyVideos34.db"
    fi
}

################
###  Defaults  #
################
### Full path to the video-database ; may be absolute (preceeded by a
### slash "/") or relative form the current directory
DBPATH=`getDbPath`
OUTDIR="$HOME"


################
###  Args    ###
################
while getopts "vd:o:h" OPTION
do
    case $OPTION in
        h) usage
        exit 1
        ;;
    d) DBPATH="$OPTARG"
        ;;
    o) OUTDIR="$OPTARG"
        # put files here
        ;;
    esac
done

################
### Settings ###
################

### Filenames for results and intermediate data
### You may change these to any name and place you like but beware not to
### overwrite or delete files you may still need
PREFIX="$OUTDIR/xbmc_"
DBPATHLIST="${PREFIX}db_path.lst"
DBFILESLIST="${PREFIX}db_files.lst"
FINDLIST="${PREFIX}find.lst"
DIFFLIST="${PREFIX}diff.lst"
DBONLYLIST="${PREFIX}db-only.lst"
FSONLYLIST="${PREFIX}fs-only.lst"
STACKEDLIST="${PREFIX}db-stacked.lst"

### Programs used ; either absolute path or command only if path to the
### binary is in variable $PATH ; each command may be extended by optional
### arguments - refer to the specific manpage for details
SQLITECMD="sqlite3" ; FINDCMD="find" ; SORTCMD="sort"
GREPCMD="grep" ; RMCMD="rm" ; UNIQCMD="uniq"
DIFFCMD="diff -a -b -B -U 0 -d --suppress-common-lines"
SEDCMD="sed"

#######################################
### Changes within the working code ###
#######################################

### There is a list of suffixes, that we will search for. You may add,
### delete or modify any entry to fit your needs, but respect the
### correct escaping of newlines

### We don't want to descent into subdirectories as they are usually
### represented by their own path-entry in the database. Deep scans would
### lead to multiple hits on the same file. But if for some reason not all
### path elements are represented in the database, you may find and delete
### the following string and force $FINDCMD to look into all subdirectories
### in any given path
### "-maxdepth 1"

####################
### working code ###
####################
${RMCMD} -f  ${PREFIX}*.1st

if [ ! -s  "$DBPATH" ] ; then
    echo "DBPATH: $DBPATH is not found or empty" >&2
    exit 1
fi

${SQLITECMD} -list -separator '' "${DBPATH}" \
    "select strPath from path order by strPath;" | ${GREPCMD} -vE 'http://|rtmp://' \
    | ${SORTCMD} > "${DBPATHLIST}"

if [ ! -s "${DBPATHLIST}" ] ; then
    echo "Error running $SQLITECMD" >&2
    echo ${SQLITECMD} -list -separator '' ${DBPATH} \
    "select strPath from path order by strPath;" \
    \| ${SORTCMD} \> ${DBPATHLIST}
    exit 1
fi

${SQLITECMD} -list -separator '' "${DBPATH}" \
    "select strPath, strFilename from path, files where path.idPath = files.idPath order by strPath, strFilename;" \
    | ${SORTCMD} | ${GREPCMD} -vE 'http://|rtmp://' > "${DBFILESLIST}"

if [ ! -s "$DBFILESLIST" ] ; then
    echo "Error Running $SQLITECMD"
    echo ${SQLITECMD} -list -separator '' ${DBPATH} \
    "select strPath, strFilename from path, files where path.idPath = files.idPath order by strPath, strFilename;" \
    \| ${SORTCMD} \> ${DBFILESLIST}
    exit 2
fi

IFS='
'
for fPATH in $(<${DBPATHLIST}) ; do
#    echo Searching: "${fPATH}"
    $FINDCMD "${fPATH}" -maxdepth 1 \
    -name '*.avi'  -o \
    -name '*.divx' -o \
    -name '*.m2v'  -o \
    -name '*.mkv'  -o \
    -name '*.mp4'  -o \
    -name '*.mpeg' -o \
    -name '*.mpg'  -o \
    -name '*.ogm'  -o \
    -name '*.vob'  -o \
    -name '*.iso'  \
    2> /dev/null | ${SORTCMD} | ${SEDCMD} 's|//|/|g'   >> ${FINDLIST}
done
unset IFS
${DIFFCMD} ${FINDLIST} ${DBFILESLIST} | ${GREPCMD} -v "^@@" | ${GREPCMD} -v [+-]\\{3\\} | ${SORTCMD} -k 1.2 | ${UNIQCMD} -s 1 > ${DIFFLIST}
${GREPCMD} ^+ < ${DIFFLIST} | ${GREPCMD} -v 'stack:///' | ${GREPCMD} -v 'http://' | ${GREPCMD} -v '^+/$' > ${DBONLYLIST}
${GREPCMD} ^- < ${DIFFLIST} > ${FSONLYLIST}
${GREPCMD} "stack:///" < ${DIFFLIST} > ${STACKEDLIST}
${RMCMD} ${DBPATHLIST} ${DBFILESLIST} ${FINDLIST} ${DIFFLIST} 2>/dev/null
echo "Results located here: " ${OUTDIR}/*.lst
Reply
#10
Thanks for your help on this script. I really appreciate your support and will test the modified script a.s.a.p.

deathinator Wrote:I updated the script to work on a mac. I added a -d option for specify the database location. Also, I added the -o so the user can specify the output directory

My intention was to present a simple script that does a simple job. I also considered using command line arguments but my thought was, that it would blow the script up and make the use less comfortable. Moreover I assume, that every one who wants to use this script modifies the variables to his needs (if modification is needed at all) and then runs the script whenever he likes and doesn't care about the settings anymore. For example no one would run the script once on a linux-system and afterwards on a mac. "I could be wrong now, but I don't think so" Rolleyes

The location of the database doesn't change on a running installation and if it does, the user can edit the script again. Regarding the output location it could be useful to temporarily put the files to another place, but I guess it is reasonable to use the standard target and then manually move the files to the desired place.

Quote:As for some small bug fixes, I quoted most of the variable used, as files names with spaces would bomb the script
The handling of filenames with spaces in it was one of my little problems. For some reasons most of the usual tricks didn't work. But you are right, I didn't consider handling the settings (especially the PREFIX).

Sorry, but some of your other changes don't make much sense:

You changed the complete definition of files to be removed by a wildcard-matching: ${RMCMD} -f ${PREFIX}*.1st. This is a quite bad idea as you don't know what files are being matched. I am using several other scripts where the PREFIX is set to the same value and all results would be removed. As a rule of thumb you should never delete files that you didn't create. Declaring them in the script isn't that hard and once it's done, there is no need to think about anymore.

You moved the grepping of special paths (http, rtmp) once to the sqlite-cmd and then also grepped http and stack when creating db-only list. My reason for putting this into one single command ("grep -v ://") was to match all special paths (as they all contain this string) and the reason for not grepping directly when querying the database was to be able to see ALL paths in the output file. So one can disable the deletion of the files at the end of the script and see all intermediate output files with all results in.

Quote:Can I/someone update the wiki with this new set of code?
Within the wiki you could make changes by yourself but I'd appreciate if I could keep managing the script itself. As we can see, opinions might differ. Discussion is welcome, of course. In this thread as well as in the wiki but preferably we should keep it in the forum.
Nevertheless you could add another wikipage put your script on it and link to it.
Reply
#11
@deathinator: I must confess, I made a mistake when discussing the changes you made. I compared your script with my actual used script at home which already is a (slightly) improved version of the published one.

I'll update the script in the wiki to reflect the latest changes, also including some of the ones made by deathinator.

EDIT: the wiki is updated now
Reply
#12
deathinator Wrote:
Code:
[...]
for fPATH in $(<${DBPATHLIST}) ; do
#    echo Searching: "${fPATH}"
    $FINDCMD "${fPATH}" -maxdepth 1 \
[...]
    -name '*.iso'  \
    2> /dev/null | ${SORTCMD} | [color=red]${SEDCMD} 's|//|/|g'[/color]   >> ${FINDLIST}
done
${GREPCMD} ^+ < ${DIFFLIST} | [color=red]${GREPCMD} -v 'stack:///' | ${GREPCMD} -v 'http://'[/color] | ${GREPCMD} -v '^+/$' > ${DBONLYLIST}
[color=red]${GREPCMD} "stack:///"[/color] < ${DIFFLIST} > ${STACKEDLIST}
[...]
How do you expect to match anything with the grep-commands when some lines before merging all multiple slashes to single ones? Do I overlook something?
Reply
#13
BaerMan Wrote:Thanks for your help on this script. I really appreciate your support and will test the modified script a.s.a.p.
Thank you for coming up with this idea, it has helped me organize what's in the db.


BaerMan Wrote:The location of the database doesn't change on a running installation and if it does, the user can edit the script again. Regarding the output location it could be useful to temporarily put the files to another place, but I guess it is reasonable to use the standard target and then manually move the files to the desired place.
I agree that it doesn't change. I was testing some scrapping and file regex pattern matches. It was easier for me to copy the db to different location , test a new scrapper, and compare the modified db. Considering I leave the default db path, I believe this option can be useful.


BaerMan Wrote:You changed the complete definition of files to be removed by a wildcard-matching: ${RMCMD} -f ${PREFIX}*.1st. This is a quite bad idea as you don't know what files are being matched. I am using several other scripts where the PREFIX is set to the same value and all results would be removed. As a rule of thumb you should never delete files that you didn't create. Declaring them in the script isn't that hard and once it's done, there is no need to think about anymore.
Agree. I was clearly just being lazy and assuming only this script creates *.1st files . Laugh

BaerMan Wrote:You moved the grepping of special paths (http, rtmp) once to the sqlite-cmd and then also grepped http and stack when creating db-only list. My reason for putting this into one single command ("grep -v ://") was to match all special paths (as they all contain this string) and the reason for not grepping directly when querying the database was to be able to see ALL paths in the output file. So one can disable the deletion of the files at the end of the script and see all intermediate output files with all results in.
I see your point, but I forgot why I choose this method, something triggered me to to do and if it comes back to me I will post.

BaerMan Wrote:Within the wiki you could make changes by yourself but I'd appreciate if I could keep managing the script itself. As we can see, opinions might differ. Discussion is welcome, of course. In this thread as well as in the wiki but preferably we should keep it in the forum.
Nevertheless you could add another wikipage put your script on it and link to it.

How about adding the MacOS portion and quote the variables? I think everyone will benefit from those changes. I'll go ahead and integrate those two changes only and post them here.

Thanks
Reply
#14
deathinator Wrote:Thank you for coming up with this idea, it has helped me organize what's in the db.
Nice to hear that it is useful for others too! Big Grin


Quote:I agree that it doesn't change. I was testing some scrapping and file regex pattern matches. It was easier for me to copy the db to different location , test a new scrapper, and compare the modified db. Considering I leave the default db path, I believe this option can be useful.
You could also just copy the script to another place and heavily modify it to your needs. That's what I always do. Thank the Lord there are commands like 'rm' to get rid of this stuff now and then. Laugh

Quote:I see your point, but I forgot why I choose this method, something triggered me to to do and if it comes back to me I will post.
Maybe the actual version fits your needs. Tried it out already?

Quote:How about adding the MacOS portion and quote the variables? I think everyone will benefit from those changes. I'll go ahead and integrate those two changes only and post them here.
You can add a chapter to the wikipage pointing out some (useful) settings. I could integrate it to the script too, but I don't want the comments to become three times as big as the code. Confused
Reply
#15
Hi BaerMan,

As really loved your idea with this script I've decided to port it to a web interface and making it cross-platform.

I had to change the code completely (to get to work in Javascript) but the idea is still there.

you can try it here and let me know what you think.

Usage :
- unzip the file in the XBMC web directory
- enable the XBMC web server
- open your browser to http://<your_XBMC_IP>:<port>/files
- wait a bit ...

For now it works with firefox, I'm having troubles with IE

If you don't mind, I will also add it into my app XWMM and adding you to the credits.

Happy Holidays
Reply

Logout Mark Read Team Forum Stats Members Help
[LINUX] bash script for finding not-scraped movies0