Login at Kodi Home

**AchimTuran** · (This post was last modified: 2015-01-10, 19:34 by AchimTuran.)

Hello XBMC community ,

since a few weeks I 'm already thought why there is no convolution engine in XBMC for audio output.
It would be really great if we have an equalizer, digital room correction ... etc. in XBMC. So I wanna have such functionality like BruteFIR in XBMC.
Hence the question: is there someone working on an implementation of an convolution engine for XBMC?

To better understand the partitioned convolution algorithm of BruteFIR, I have done a test implementation using MATLAB and CUDA (GPU programming). Of course, at the moment it is not finished and it needs to be done a lot, but I'll keep working on it and would like to create a library that allows partitioned convolution for use with CUDA . Surely, I will implement the same algorithms on the CPU, so that PCs can be used that do not have a NVIDIA graphics card.
Another goal might be to replace CUDA by OpenCL, but at the moment I have no programming experience with OpenCL.

Surely you introduce yourself the question why he takes CUDA for partitioned convolution. The Answer is quite simple, because my PC in my living room is an NVIDIA ION and CPU is not really fast, but the GPU has a lot of processing power Wink

I have searched a bit in the code of the audio engine to find a good place for the calculation of the convolution. While I don't understand the whole structure of the audio engine of XBMC (it is simply too powerful;-)), but I think a good place could be the method "Output Sample (...)" in the "ActiveAESink.cpp" file (see code snippet below). Is this a good place for testing?

So now it's your turn! What do you think about my plan?

Code:
switch (m_convertState)

{

....

convolutionEngine (...);

while (frames> 0)

{

   MaxFrames = std :: min (frames, m_sinkFormat.m_frames);

   written = m_sink-> AddPackets (buffer, MaxFrames, true, true);

...

**fritsch** · (This post was last modified: 2014-01-02, 14:43 by fritsch.)

The ActiveAESink object is a high priority objects which's only job is to deliver packets onto the sink as fast and consequent as possible. Doing big calculations in there would drive this design nuts as the sink might run into underruns.

We were discussing such ideas as an additional stage with ActiveAE.cpp - so doing all the calculations that is needed before and let the sink do it's job.

Btw. have a look into SSE code, many things can easily be done factor 4 without needing to do a conectext switch / data copy to the gpu.

**AchimTuran** · 2014-01-02, 19:13

Hello fritsch,

I thought a convolution algorithm should be calculated in real time and the result should be immediately output to the sound card. Am I wrong?

Where are the sample rate conversions carried out? Are these operations offline and there is only a buffer for ActiveAESink filled and through this, the samples are transferred to the sound card?

I think you're right and we also need an implementation with SSE, because not every computer has an NVIDIA card! For this reason I want to make my software architecture so that you can replace the core functions such as partitioned convolution (complex Multiplitkation, addition and the corresponding buffer administrations) with other implementations (in principle as in the implementation of the various sound APIs (AESinkFactory.cpp))

I had thought in my implementation of weaker computer, such as the NVIDIA ION (if I have the right overview, then there are many users that use this architecture for XBMC) that can only play 1080p videos when the GPU is used for processing. Therefore I can only look full-HD movies from Youtube with XBMC, because the latest version from the adobe flash player does not support GPU!

In addition, I was thinking of multi-channel audio when very long impulse responses are to be used for convolution, then a lot of additions and multiplications have to be calculated! Therefore, the idea of using CUDA. Right now my implementation looks like that, a frame of samples is transferred to the GPU and only the result must be transferred back to the CPU. Thus, the data overhead is very "low".

Maybe we could discuss the details on IRC.

**fritsch** · 2014-01-02, 19:17

The only thing platformspecific we do is the Sink (despite Mac OSX which is currently ported). Read through ActiveAE.cpp with pen and paper and see what the statemachine does. We use swresample from ffmpeg with different quality levels. The decoding happens losless, as does the Format conversion - despite when the sink does not support a 32 bit output, something needs to be adjusted. For volume we use SSE code.

The worst thing that can happen is Audio gets dropped and user hears this. Therefore doing that convolution in our sink object is a no go.

Happy hacking.

Hedda · (This post was last modified: 2014-01-02, 20:07 by Hedda.)

(2014-01-02, 14:42)fritsch Wrote: We were discussing such ideas as an additional stage with ActiveAE.cpp - so doing all the calculations that is needed before and let the sink do it's job.

Maybe you guys could take inspiration from the OpenMAX DL API to make post processing modular, especially the SP building block part of the OpenMAX DL API

http://en.wikipedia.org/wiki/OpenMAX#Development_layer

(2014-01-02, 19:13)wisler Wrote: I think you're right and we also need an implementation with SSE, because not every computer has an NVIDIA card! For this reason I want to make my software architecture so that you can replace the core functions such as partitioned convolution (complex Multiplitkation, addition and the corresponding buffer administrations) with other implementations (in principle as in the implementation of the various sound APIs (AESinkFactory.cpp))

You might also want to look at NEON on ARM, which is the ARM equivalent instruction set of SSE on x86

http://www.arm.com/products/processors/t...s/neon.php

Some newer high-end ARM SoC are now also starting to come with OpenCL capabable GPUs

**AchimTuran** · 2014-01-13, 09:09

Hi @ ll,

fritsch I agree with you and AESink should do their job.

That's why I looked at the CActiveAE class more closely and try to understand what is being done in the state machine. I have the method RunStages () found, where I started to implement the functionalities of my class CActiveConvolver. I am now exactly between the functions for resampling the input data from an audio stream and before the functions fading, Replay Gain, volume, ...

I think here is a good place for the convolution engine. So I stay in the XBMC internal format and have the same sampling rate. At the moment I can read the samples of the input stream and in theory I can convolve them. Since my filter algorithm or library is not ready yet, so first tests are performed later.

So briefly my imagined Software Architecture looks that way:
XBMC -> ActiveAE -> CActiveConvolver -> My library with code for SSE, CUDA, NEON. Furthermore, there are internal buffers for the convolution engine for any channel.
At the end I wanna write a wrapper (CActiveConvolver) class for my convolution library and the library is included as a *.lib. I think it is good to move that library into \project\BuildDependencies\lib. So I can work on the lib without touching XBMCs wrapper class.

Now I have some questions for XBMC programming:

1. How can I get settings from the GUI down to my class CActiveConvolver? An addon or in Settings-->System-->audio hardware or how I should do this?

2. How is it possible to open a wav file and bring this information to my class? I want to use the class CActiveBuffer to load the FIR filter coefficients for each channel and then I convert this information with the internal XBMC functions to the right destination format and sampling frequency.

3. Can XBMC plot functions (frequency response of the loaded FIR filter)?

4. Can I use FFTW in my Library for the fourier transformation? On the GPU side i wanna use cufft from NVIDIA.

@Hedda:

Quote:You might also want to look at NEON on ARM, which is the ARM equivalent instruction set of SSE on x86

Thanks I also wanna look at NEON.

Quote:Maybe you guys could take inspiration from the OpenMAX DL API to make post processing modular, especially the SP building block part of the OpenMAX DL API

I didn't know OpenMAX but I wanna look deeper into it. Sounds interesting. Big Grin

**AchimTuran** · 2015-01-02, 17:53

Hi @ll,

today I wanna show some new screenshots about my adsp.xconvolver Addon. It's a very pre alpha version, but it can convolve long FIR-Filters with an input audio signal and the new ADSP-Addon-System. At the moment it's not easy to use, so the next steps will be to improve the GUI and the usability.

Features I have planned:

loading FIR-Filters from wav-files
scrapping impulse responses directly from http://www.openairlib.net/auralizationdb with an python addon
loading HRTF to use headphones and listen to surround sound through headphones
plotting the FIR-Filter in time and frequency domain
easy filter configuration with the FilterManager (FIR-Filter channel, -image, -information's, ...)
benchmark to optimize LibXConvolver, that you get maximum performance on your system
optimizations for SSE3 and CUDA (for very long FIR-Filters) and Native (C-implementation)
optimizations for SSE2, SSE4, AVX, AVX2, CUDA, OpenCL, ARM_VFP and NEON are planned for future releases
in an later version a RoomEQ module will be available to meassure IRs

Hedda · (This post was last modified: 2015-01-15, 19:35 by Hedda.)

(2014-01-01, 13:49)wisler Wrote: Another goal might be to replace CUDA by OpenCL, but at the moment I have no programming experience with OpenCL.

Surely you introduce yourself the question why he takes CUDA for partitioned convolution. The Answer is quite simple, because my PC in my living room is an NVIDIA ION and CPU is not really fast, but the GPU has a lot of processing power

Maybe this could now be achieved simpler than before with CF4OCL (C Framework for OpenCL) which is a open source and licensed cross-platform library for developing OpenCL projects in C or C++

http://fakenmc.github.io/cf4ocl/
https://github.com/FakenMC/cf4ocl/

CF4OCL allows the rapid development of OpenCL host programs in C/C++ while making it easier to provide OpenCL, simplify the analysis of OpenCL environments, etc.

CF4OCL library could possible also be reused to accelerate other processes in Kodi using OpenCL as well

Summary

The C Framework for OpenCL, cf4ocl, is a cross-platform pure C object-oriented framework for developing and benchmarking OpenCL projects in C. It aims to:

1. Promote the rapid development of OpenCL host programs in C (with support for C++) and avoid the tedious and error-prone boilerplate code usually required.
2. Assist in the benchmarking of OpenCL events, such as kernel execution and data transfers. Profiling comes for free with cf4ocl.
3. Simplify the analysis of the OpenCL environment and of kernel requirements.
4. Allow for all levels of integration with existing OpenCL code: use as much or as few of cf4ocl required for your project, with full access to the underlying OpenCL objects and functions at all times.

Features

Object-oriented interface to the OpenCL API
New/destroy functions, no direct memory alloc/free
Easy (and extensible) device selection
Simple event dependency mechanism
User-friendly error management
OpenCL version independent
Integrated profiling
Tested on Linux, OSX and Windows

**AchimTuran** · 2015-01-25, 18:54

Hi Hedda,

sorry for the late response, but I'm very busy till february. Thank's for the tip with CF4OCL, didn't know it before. If I have more time, I will look deeper into it.

CUDA implementation for non-uniform partition convolution is almost finish and I hope to release an version in february.

Hedda · 2015-01-27, 12:26

Cool!

Maybe you should try to discuss CF4OCL with alwinus and FernetMenta too since it could possible be reused to accelerate Audio DSP plugins as well?

http://forum.kodi.tv/showthread.php?tid=186857

ironic_monkey · (This post was last modified: 2015-01-27, 12:30 by ironic_monkey.)

uhm, what do you think this is? ;P

images in post 7 should be a cluestick Smile