Problems with cleanstrings
#1
Question 
Hi guys

I would really appreciate some help debugging my cleanstrings entry in advancedsettings.xml.

By my understanding, cleanstrings is used to extract a useful search string from a filename for the purpose of scraping, by removing unwanted characters. Let me know if I've got the wrong idea here.

My problem is that I have a lot of movie series named with the convention seriesname - ## - moviename.xxx and high definition movies with a [HD] flag at the start of the filename.

This really seems to confuse the scraper, and for obvious reasons I wanted to avoid renaming all of the files. So I was relieved when I read about advancedsettings.xml and cleanstrings.

At first I tried using a more general expression but when that didn't work, I wanted to see if I could confirm it working for a specific string. So I have moved my james bond films into a subfolder and repeatedly added and removed the folder using different cleanstrings expressions, but it hasn't worked once. I have pasted some extracts of the important parts from xbmc.log below (with debugging on).


Code:
14:12:20 T:3924 M:2603417600  NOTICE: Loaded advancedsettings.xml from special://profile/advancedsettings.xml
14:12:20 T:3924 M:2603413504  NOTICE: Contents of special://profile/advancedsettings.xml are...
                                            [b]<advancedsettings>
                                              <video>
                                                <cleanstrings>
                                                  <regexp>(007 - .. - )</regexp>
                                                </cleanstrings>
                                              </video>
                                            </advancedsettings>[/b]

14:13:31 T:2284 M:2579779584   DEBUG: No NFO file found. Using title search for '[u][b]I:\videos\movies2\james bond\007 - 01 - Dr. No [1962].mkv[/b][/u]'
14:13:31 T:3284 M:2579775488   DEBUG: thread start, auto delete: 0
14:13:31 T:3284 M:2579775488   DEBUG: Thread 3284 terminating
14:13:31 T:2284 M:2579738624   DEBUG: CIMDB::InternalFindMovie: Searching for '[u][b]007 - 01 - dr. no[/b][/u]' using themoviedb.org scraper (file: 'tmdb.xml', content: 'movies', language: 'en', date: '2009-11-11', framework: '1.1')

14:13:42 T:2284 M:2578227200   DEBUG: No NFO file found. Using title search for '[u][b]I:\videos\movies2\james bond\007 - 02 - From Russia With Love [1963].mkv[/b][/u]'
14:13:42 T:2284 M:2578227200   DEBUG: CIMDB::InternalFindMovie: Searching for '[u][b]007 - 02 - from russia with love[/b][/u]' using themoviedb.org scraper (file: 'tmdb.xml', content: 'movies', language: 'en', date: '2009-11-11', framework: '1.1')

14:13:42 T:2284 M:2578313216   DEBUG: No NFO file found. Using title search for '[u][b]I:\videos\movies2\james bond\007 - 03 - Goldfinger [1964].mkv[/b][/u]'
14:13:42 T:2284 M:2578313216   DEBUG: CIMDB::InternalFindMovie: Searching for '[u][b]007 - 03 - goldfinger[/b][/u]' using themoviedb.org scraper (file: 'tmdb.xml', content: 'movies', language: 'en', date: '2009-11-11', framework: '1.1')

So you can see that it correctly loads the advanced settings file. But when the search is done, the part I wanted to remove is still in the string. I used the simplest possible expression, so I can't see what I could have done wrong unless I completely misunderstood something.

Thanks in advance for your help
Reply
#2
I have tried using a lot of different cleanstrings expressions, and have actually managed to get some of them to have an effect. However it seems that instead of removing only the matched string, everything after it is also removed. For example, using the same scenario as my previous post, I tried (- .. -), expecting it to remove only the "- 01 -" parts, but everything after that is also removed, leaving just the string "007".
I had assumed that each <regexp> item would remove only the matched parts from a filename, because the wiki doesn't give a very detailed explanation of what it is supposed to do. Can anyone confirm exactly how cleanstrings is supposed to work?
Reply
#3
Code:
(\[.*\])
this is the simplest of the default cleanstrings expressions. it selects only what we want to remove (in this case anything in brackets).
Reply
#4
spiff,

In my experience, it will not only remove anything in brackets, but also everything after the first match.

For example, if a file was named "apples [oranges] bananas" you would expect it to be changed to "apples bananas" after cleaning, removing only the string in brackets. But in my case it will be changed to just "apples ".

Is this normal or only happening to me?
Reply
#5
checked the code, you are correct, that's how it behaves.
Reply
#6
agrajagzz9 Wrote:Hi guys
My problem is that I have a lot of movie series named with the convention seriesname - ## - moviename.xxx and high definition movies with a [HD] flag at the start of the filename.

Lets focus on your actual problem, you really need to rename your files but don't want to invest the time required, Yes?

Just get one of the many GUI based file renamer programs available for the mac and pc, you'll find your original problem isn't really as hard as you thought.

If it was me I'd just write a Q&D perl program to rename the files, but it doesn't sound like regex is your friend.

Having a consistently named library is a worthwhile excersize.

|-<:)
Reply
#7
xbmchead,
I have no problem using regex renaming tools to rename my files, and in fact i have used them extensively to get my library how it is at the moment. I happen to like how my files are named and didn't want to change that just to use xbmc.

I'm not sure what the intention was when implementing the cleanstrings feature, but it seems like it would be much more useful if it allowed you to remove strings from anywhere within the filename. The way it works at the moment, everything to be removed has to be at the end of the filename, otherwise the useful parts are also removed. Actually, with this implementation most of the default entry is unnecessary, you could just match the first bracket and everything after it would be wiped out as well. So cleanstrings is pretty limited in what it can do.
Reply
#8
Just using the first bracket would match too much.

Patches are welcome to improve it Smile
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#9
I don't have any experience with large projects, but i'm looking into it at the moment...
Reply
#10
(2010-06-24, 00:45)jmarshall Wrote: Just using the first bracket would match too much.

Patches are welcome to improve it Smile

how about this one?

This would be consistent with the re-match "cutting" behaviour,
as mentioned in this thread. The current implementation only cuts off
the tail, so there i no way to use it to cut off the head,

for example,
there is no way to transform
I.Planet-of-the-apes into Planet-of-the-apes, so cutting off the "I." head.

With the following patch this becomes possible,

Code:
diff --git a/xbmc/Util.cpp b/xbmc/Util.cpp
index b2c0f0e..0a3a20b 100644
--- a/xbmc/Util.cpp
+++ b/xbmc/Util.cpp
@@ -276,8 +276,25 @@ void CUtil::CleanString(const CStdString& strFileName, CStdString& strTitle, CSt
       continue;
     }
     int j=0;
-    if ((j=reTags.RegFind(strFileName.c_str())) > 0)
-      strTitleAndYear = strTitleAndYear.Mid(0, j);
+    CLog::Log(LOGINFO, "trying to match %i:%s on <%s>", i, regexps[i].c_str(), strTitleAndYear.c_str());
+    if ((j=reTags.RegFind(strTitleAndYear.c_str())) >= 0)
+    {
+      int len = reTags.GetFindLen();
+
+      CLog::Log(LOGINFO, "j:%d len:%d", j, len);
+
+      if (j == 0)
+         strTitleAndYear = strTitleAndYear.Mid(len);
+      else
+      {
+         CStdString left  = strTitleAndYear.Left(j);  
+         CStdString right = strTitleAndYear.Mid(j + len);  
+         CLog::Log(LOGINFO, "left:<%s> right:<%s>", left.c_str(), right.c_str());
+         strTitleAndYear = left + right;
+      }
+
+      CLog::Log(LOGINFO, "match %i:%s -> <%s>", i, regexps[i].c_str(), strTitleAndYear.c_str());
+    }  
   }

   // final cleanup - special characters used instead of spaces:

now with the following cleanstrings addition,
Code:
<video>
      <cleanstrings action="prepend">
      <regexp>[0-9IVX]+[.]</regexp>
      </cleanstrings>
</video>

you get,
Code:
00:15:07 T:140122601133824    INFO: trying to match 0:[0-9IVX]+[.] on <I.Planet-of-the-apes>
00:15:07 T:140122601133824    INFO: j:0 len:2
00:15:07 T:140122601133824    INFO: match 0:[0-9IVX]+[.] -> <Planet-of-the-apes>
00:15:07 T:140122601133824    INFO: trying to match 1:[ _\,\.\(\)\[\]\-](ac3|dts|custom|dc|remastered|divx|divx5|dsr|dsrip|dutch|dvd|dvd5|dvd9|dvdrip|dvdscr|dvdscreener|screener|dvdivx|cam|fragment|fs|hdtv|hdrip|hdtvrip|internal|limited|multisubs|ntsc|ogg|ogm|pal|pdtv|proper|repack|rerip|retail|r3|r5|bd5|se|svcd|swedish|german|read.nfo|nfofix|unrated|extended|ws|telesync|ts|telecine|tc|brrip|bdrip|480p|480i|576p|576i|720p|720i|1080p|1080i|3d|hrhd|hrhdtv|hddvd|bluray|x264|h264|xvid|xvidvd|xxx|www.www|cd[1-9]|\[.*\])([ _\,\.\(\)\[\]\-]|$) on <Planet-of-the-apes>
00:15:07 T:140122601133824    INFO: trying to match 2:(\[.*\]) on <Planet-of-the-apes>
00:15:07 T:140122601133824    INFO: trying to match 3:[ _\,\.\(\)\[\]\-](ac3|dts|custom|dc|remastered|divx|divx5|dsr|dsrip|dutch|dvd|dvd5|dvd9|dvdrip|dvdscr|dvdscreener|screener|dvdivx|cam|fragment|fs|hdtv|hdrip|hdtvrip|internal|limited|multisubs|ntsc|ogg|ogm|pal|pdtv|proper|repack|rerip|retail|r3|r5|bd5|se|svcd|swedish|german|read.nfo|nfofix|unrated|extended|ws|telesync|ts|telecine|tc|brrip|bdrip|480p|480i|576p|576i|720p|720i|1080p|1080i|3d|hrhd|hrhdtv|hddvd|bluray|x264|h264|xvid|xvidvd|xxx|www.www|cd[1-9]|\[.*\])([ _\,\.\(\)\[\]\-]|$) on <Planet-of-the-apes>
00:15:07 T:140122601133824    INFO: trying to match 4:(\[.*\]) on <Planet-of-the-apes>
00:15:07 T:140122601133824   DEBUG: FindMovie: Searching for 'Planet-of-the-apes' using The MovieDB scraper (path: '/home/popp/.xbmc/addons/metadata.themoviedb.org', content: 'movies', version: '3.5.0')
00:15:07 T:140122601133824   DEBUG: scraper: CreateSearchUrl returned <url>http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&amp;query=planet-of-the-apes&amp;year=&amp;language=en</url>
00:15:07 T:140122601133824   DEBUG: CurlFile::Open(0x46d3ca0) http://api.themoviedb.org/3/search/movie?api_key=57983e31fb435df4df77afb854740ea9&query=planet-of-the-apes&year=&language=en
00:15:07 T:140122601133824    INFO: easy_aquire - Created session to http://api.themoviedb.org
00:15:07 T:140122601133824   DEBUG: scraper: GetSearchResults returned <results></results>

You can make the patch probably shorter by using,
char * CRegExp::GetReplaceString ( const char * sReplaceExp )
but I am not familiar with this library, so I have not figured out (yet) how to specify the sReplaceExp (probably something like /matchRe// ) and have not been able to find a reference guide, I was just looking through the doxygen generated docs...






Reply
#11
There is a way to remove the front of a string by using cleandatetime EXCEPT there is a bug in the cleanstrings routine that prevents it from working. (the original match is on strFileName, but the replacement string is from strTitleAndYear - your code corrects this problem)

The way cleanstrings currently works is to clean everything after a match, so I think changing it to just delete the match (as you are doing with left and right) is wrong IMO, but otherwise I changed my personal copy do just this.
Code:
-    if ((j=reTags.RegFind(strFileName.c_str())) > 0)
-      strTitleAndYear = strTitleAndYear.Mid(0, j);
+    if ((j=reTags.RegFind(strTitleAndYear.c_str())) >= 0 && strTitleAndYear.size() != reTags.GetFindLen())
+      strTitleAndYear = (j!=0) ? strTitleAndYear.Mid(0, j): strTitleAndYear.Mid(reTags.GetFindLen());
I also added a guard against accidentally matching the whole string.
I didn't submit it yet, because I wanted to introduce a new mechanism where cleanstrings would return a list of possibilities, but I haven't gotten around to it yet
mike
PS I laughed because I looked into this because of Planet of the Apes as well (it was the first of many!)
Reply
#12
Is anyone still interested in finishing and submitting this? It would be greatly appreciated to have functional searches without updating your file names/file structure.

Thanks

Related
http://forum.xbmc.org/showthread.php?tid...pid1753789
http://trac.xbmc.org/ticket/13977
https://github.com/xbmc/xbmc/pull/1730
Reply
#13
(2010-06-24, 00:37)agrajagzz9 Wrote: I'm not sure what the intention was when implementing the cleanstrings feature, but it seems like it would be much more useful if it allowed you to remove strings from anywhere within the filename. The way it works at the moment, everything to be removed has to be at the end of the filename, otherwise the useful parts are also removed.

Couldn't agree more! cleanstrings should just remove the first match (and not basically append a hidden .* to the regex). Also, the documentation is wrong. "Please note that everything right of the match (at the end of the file name) is removed." Should apparently be "... everything UP TO AND INCLUDING right of the match (at the end of the file name) is removed". Plus '(at the end of the file name)' makes no sense either, as it's simply everything up to and including the match string.
Reply
#14
I name my movies in the same way as @agrajagzz9 and have also resorted to manually matching them.
Any chance on resurrecting this feature request? It would be a perfect solution for me.

EDIT:
I decided to take this on: https://github.com/xbmc/xbmc/pull/19219
I slightly modified the solution proposed by @dragonflight.
Reply

Logout Mark Read Team Forum Stats Members Help
Problems with cleanstrings0