使用快速傅立叶变换分析音频

问题:使用快速傅立叶变换分析音频

我正在尝试在python中创建图形频谱分析仪。

我目前正在读取16位双通道,44,100 Hz采样率音频流的1024个字节,并将两个通道的幅度平均在一起。因此,现在我有一系列256条带符号的短裤。现在,我想使用numpy之类的模块在该阵列上执行fft,并使用结果创建图形频谱分析仪,其开始时只有32条。

我已经阅读了有关快速傅立叶变换和离散傅立叶变换的维基百科文章,但是我仍然不清楚结果数组代表什么。这是我使用numpy在数组上执行fft后的数组外观:

   [ -3.37260500e+05 +0.00000000e+00j   7.11787022e+05 +1.70667403e+04j
   4.10040193e+05 +3.28653370e+05j   9.90933073e+04 +1.60555003e+05j
   2.28787050e+05 +3.24141951e+05j   2.09781047e+04 +2.31063376e+05j
  -2.15941453e+05 +1.63773851e+05j  -7.07833051e+04 +1.52467334e+05j
  -1.37440802e+05 +6.28107674e+04j  -7.07536614e+03 +5.55634993e+03j
  -4.31009964e+04 -1.74891657e+05j   1.39384348e+05 +1.95956947e+04j
   1.73613033e+05 +1.16883207e+05j   1.15610357e+05 -2.62619884e+04j
  -2.05469722e+05 +1.71343186e+05j  -1.56779748e+04 +1.51258101e+05j
  -2.08639913e+05 +6.07372799e+04j  -2.90623668e+05 -2.79550838e+05j
  -1.68112214e+05 +4.47877871e+04j  -1.21289916e+03 +1.18397979e+05j
  -1.55779104e+05 +5.06852464e+04j   1.95309737e+05 +1.93876325e+04j
  -2.80400414e+05 +6.90079265e+04j   1.25892113e+04 -1.39293422e+05j
   3.10709174e+04 -1.35248953e+05j   1.31003438e+05 +1.90799303e+05j...

我想知道这些数字究竟代表什么,以及如何将这些数字转换为32个条形图的每个高度的百分比。另外,我应该将两个通道平均在一起吗?

I am trying to create a graphical spectrum analyzer in python.

I am currently reading 1024 bytes of a 16 bit dual channel 44,100 Hz sample rate audio stream and averaging the amplitude of the 2 channels together. So now I have an array of 256 signed shorts. I now want to preform a fft on that array, using a module like numpy, and use the result to create the graphical spectrum analyzer, which, to start will just be 32 bars.

I have read the wikipedia articles on Fast Fourier Transform and Discrete Fourier Transform but I am still unclear of what the resulting array represents. This is what the array looks like after I preform an fft on my array using numpy:

   [ -3.37260500e+05 +0.00000000e+00j   7.11787022e+05 +1.70667403e+04j
   4.10040193e+05 +3.28653370e+05j   9.90933073e+04 +1.60555003e+05j
   2.28787050e+05 +3.24141951e+05j   2.09781047e+04 +2.31063376e+05j
  -2.15941453e+05 +1.63773851e+05j  -7.07833051e+04 +1.52467334e+05j
  -1.37440802e+05 +6.28107674e+04j  -7.07536614e+03 +5.55634993e+03j
  -4.31009964e+04 -1.74891657e+05j   1.39384348e+05 +1.95956947e+04j
   1.73613033e+05 +1.16883207e+05j   1.15610357e+05 -2.62619884e+04j
  -2.05469722e+05 +1.71343186e+05j  -1.56779748e+04 +1.51258101e+05j
  -2.08639913e+05 +6.07372799e+04j  -2.90623668e+05 -2.79550838e+05j
  -1.68112214e+05 +4.47877871e+04j  -1.21289916e+03 +1.18397979e+05j
  -1.55779104e+05 +5.06852464e+04j   1.95309737e+05 +1.93876325e+04j
  -2.80400414e+05 +6.90079265e+04j   1.25892113e+04 -1.39293422e+05j
   3.10709174e+04 -1.35248953e+05j   1.31003438e+05 +1.90799303e+05j...

I am wondering what exactly these numbers represent and how I would convert these numbers into a percentage of a height for each of the 32 bars. Also, should I be averaging the 2 channels together?


回答 0

您要显示的阵列是音频信号的傅立叶变换系数。这些系数可用于获取音频的频率内容。FFT是为复数值输入函数定义的,因此即使您输入的都是实数值,您得出的系数也将是虚数。为了获得每个频率的功率量,您需要计算每个频率的FFT系数的大小。这不仅是系数的实部,还需要计算其实部和虚部的平方和的平方根。也就是说,如果您的系数为a + b * j,则其大小为sqrt(a ^ 2 + b ^ 2)。

一旦计算了每个FFT系数的幅度,就需要弄清楚每个FFT系数属于哪个音频。N点FFT将为您提供从0开始的N个等间隔频率的信号频率内容。因为您的采样频率为44100个样本/秒。并且FFT中的点数为256,则您的频率间隔为44100/256 = 172 Hz(大约)

数组中的第一个系数将是0频率系数。这基本上是所有频率的平均功率水平。其余的系数将从0开始以172 Hz的倍数递增,直到达到128。在FFT中,您最多只能测量一半采样点的频率。阅读有关奈奎斯特频率如果您是惩罚的嘴,并且需要知道为什么,请奈奎斯特-香农采样定理,但基本的结果是,您的低频将被复制或混叠在高频频段中。因此,频率将从0开始,对每个系数增加172 Hz,直到N / 2系数,然后降低172 Hz,直到N-1系数。

那应该是足够的信息来帮助您入门。如果您想对FFT进行比维基百科更平易近人的介绍,则可以尝试了解数字信号处理:第二版。。这对我很有帮助。

这就是这些数字所代表的含义。可以通过将每个频率分量幅度乘以所有分量幅度的总和来转换为高度百分比。虽然,这只能为您提供相对频率分布的表示,而不是每个频率的实际功率。您可以尝试按频率分量的最大幅度进行缩放,但我不确定该显示效果是否很好。找到可行的比例因子的最快方法是对响亮和柔和的音频信号进行实验,以找到正确的设置。

最后,如果要整体显示整个音频信号的频率内容,则应将两个通道平均在一起。您正在将立体声音频混合为单声道音频并显示组合的频率。如果您想要左右两个频率分别显示,那么您将需要在每个通道上分别执行傅立叶变换。

The array you are showing is the Fourier Transform coefficients of the audio signal. These coefficients can be used to get the frequency content of the audio. The FFT is defined for complex valued input functions, so the coefficients you get out will be imaginary numbers even though your input is all real values. In order to get the amount of power in each frequency, you need to calculate the magnitude of the FFT coefficient for each frequency. This is not just the real component of the coefficient, you need to calculate the square root of the sum of the square of its real and imaginary components. That is, if your coefficient is a + b*j, then its magnitude is sqrt(a^2 + b^2).

Once you have calculated the magnitude of each FFT coefficient, you need to figure out which audio frequency each FFT coefficient belongs to. An N point FFT will give you the frequency content of your signal at N equally spaced frequencies, starting at 0. Because your sampling frequency is 44100 samples / sec. and the number of points in your FFT is 256, your frequency spacing is 44100 / 256 = 172 Hz (approximately)

The first coefficient in your array will be the 0 frequency coefficient. That is basically the average power level for all frequencies. The rest of your coefficients will count up from 0 in multiples of 172 Hz until you get to 128. In an FFT, you only can measure frequencies up to half your sample points. Read these links on the Nyquist Frequency and Nyquist-Shannon Sampling Theorem if you are a glutton for punishment and need to know why, but the basic result is that your lower frequencies are going to be replicated or aliased in the higher frequency buckets. So the frequencies will start from 0, increase by 172 Hz for each coefficient up to the N/2 coefficient, then decrease by 172 Hz until the N – 1 coefficient.

That should be enough information to get you started. If you would like a much more approachable introduction to FFTs than is given on Wikipedia, you could try Understanding Digital Signal Processing: 2nd Ed.. It was very helpful for me.

So that is what those numbers represent. Converting to a percentage of height could be done by scaling each frequency component magnitude by the sum of all component magnitudes. Although, that would only give you a representation of the relative frequency distribution, and not the actual power for each frequency. You could try scaling by the maximum magnitude possible for a frequency component, but I’m not sure that that would display very well. The quickest way to find a workable scaling factor would be to experiment on loud and soft audio signals to find the right setting.

Finally, you should be averaging the two channels together if you want to show the frequency content of the entire audio signal as a whole. You are mixing the stereo audio into mono audio and showing the combined frequencies. If you want two separate displays for right and left frequencies, then you will need to perform the Fourier Transform on each channel separately.


回答 1

尽管此线程已有多年历史,但我发现它很有帮助。我只想将我的意见提供给发现此问题并试图创建类似内容的任何人。

至于条形划分,这不应该像antti所建议的那样进行,而是根据条形数将数据均分。最有用的是将数据分成八度,每个八度是前一个频率的两倍。(即100hz是50hz之上的一个八度,这是25hz之上的一个八度)。

根据所需的小节,将整个范围划分为1 / X八度范围。根据横条上给定的中心频率A,可以从以下项获得横条的上限和下限:

upper limit = A * 2 ^ ( 1 / 2X )
lower limit = A / 2 ^ ( 1 / 2X )

要计算下一个相邻的中心频率,请使用类似的计算方法:

next lower =  A / 2 ^ ( 1 / X )
next higher = A * 2 ^ ( 1 / X )

然后,您可以对适合这些范围的数据取平均值,以获取每个条形图的幅度。

例如:我们想要划分为1/3个八度音程,并且我们从1khz的中心频率开始。

Upper limit = 1000 * 2 ^ ( 1 / ( 2 * 3 ) ) = 1122.5
Lower limit = 1000 / 2 ^ ( 1 / ( 2 * 3 ) ) =  890.9

给定44100hz和1024个样本(每个数据点之间为43hz),我们应该取平均值21到26。(890.9 / 43 = 20.72〜21和1122.5 / 43 = 26.10〜26)

(1/3八度音阶将使您在〜40hz和〜20khz之间大约30个音阶)。如您现在所知道的,随着我们的提高,我们将平均更大范围的数字。低条通常仅包含1个或少量数据点。而较高的柱可以是数百个点的平均值。原因是86hz比43hz高八度…而10086hz的声音与10043hz几乎相同。

Although this thread is years old, I found it very helpful. I just wanted to give my input to anyone who finds this and are trying to create something similar.

As for the division into bars this should not be done as antti suggest, by dividing the data equally based on the number of bars. The most useful would be to divide the data into octave parts, each octave being double the frequency of the previous. (ie. 100hz is one octave above 50hz, which is one octave above 25hz).

Depending on how many bars you want, you divide the whole range into 1/X octave ranges. Based on a given center frequency of A on the bar, you get the upper and lower limits of the bar from:

upper limit = A * 2 ^ ( 1 / 2X )
lower limit = A / 2 ^ ( 1 / 2X )

To calculate the next adjoining center frequency you use a similar calculation:

next lower =  A / 2 ^ ( 1 / X )
next higher = A * 2 ^ ( 1 / X )

You then average the data that fits into these ranges to get the amplitude for each bar.

For example: We want to divide into 1/3 octaves ranges and we start with a center frequency of 1khz.

Upper limit = 1000 * 2 ^ ( 1 / ( 2 * 3 ) ) = 1122.5
Lower limit = 1000 / 2 ^ ( 1 / ( 2 * 3 ) ) =  890.9

Given 44100hz and 1024 samples (43hz between each data point) we should average out values 21 through 26. ( 890.9 / 43 = 20.72 ~ 21 and 1122.5 / 43 = 26.10 ~ 26 )

(1/3 octave bars would get you around 30 bars between ~40hz and ~20khz). As you can figure out by now, as we go higher we will average a larger range of numbers. Low bars typically only include 1 or a small number of data points. While the higher bars can be the average of hundreds of points. The reason being that 86hz is an octave above 43hz… while 10086hz sounds almost the same as 10043hz.


回答 2

您所拥有的是一个时间长度为256/44100 = 0.00580499秒的样本。这意味着您的频率分辨率为1 / 0.00580499 = 172 Hz。从Python中获得的256个值基本上对应于从86 Hz到255 * 172 + 86 Hz = 43946 Hz的频率。您得到的数字是复数(因此,第二个数字的末尾是“ j”)。

编辑:固定错误信息

您需要通过计算sqrt(i 2 + j 2)将复数转换为幅度,其中i和j是实部和虚部。

如果要有32条,就我所知,应该取四个连续振幅的平均值,得到256/4 = 32条。

what you have is a sample whose length in time is 256/44100 = 0.00580499 seconds. This means that your frequency resolution is 1 / 0.00580499 = 172 Hz. The 256 values you get out from Python correspond to the frequencies, basically, from 86 Hz to 255*172+86 Hz = 43946 Hz. The numbers you get out are complex numbers (hence the “j” at the end of every second number).

EDITED: FIXED WRONG INFORMATION

You need to convert the complex numbers into amplitude by calculating the sqrt(i2 + j2) where i and j are the real and imaginary parts, resp.

If you want to have 32 bars, you should as far as I understand take the average of four successive amplitudes, getting 256 / 4 = 32 bars as you want.


回答 3

FFT返回N个复数值,您可以计算其中的一个module=sqrt(real_part^2+imaginary_part^2)。要获得每个频段的值,您必须对频段内所有谐波的模块求和。您可以在下面看到有关10 bar频谱分析仪的示例。必须包装C代码以获得pyd python模块。

float *samples_vett;
float *out_filters_vett;
int Nsamples;
float band_power = 0.0;
float harmonic_amplitude=0.0;
int i, out_index;

out_index=0;


for (i = 0; i < Nsamples / 2 + 1; i++)       
        {
            if (i == 1 || i == 2 || i == 4 || i == 8 || i == 17 || i == 33 || i == 66 || i == 132 || i == 264 || i == 511)
            {
                out_filters_vett[out_index] = band_power; 
                band_power = 0; 
                out_index++;  
            }

            harmonic_amplitude = sqrt(pow(ttfr_out_vett[i].r, 2) + pow(ttfr_out_vett[i].i, 2));
            band_power += harmonic_amplitude;

        }

我用Python设计并制作了整个10 led条形频谱分析仪。取而代之的是使用nunmpy库(太大而没有用,无法仅获取FFT),而是创建了一个python pyd模块(仅27KB)来获取FFT并将整个音频频谱拆分为多个频段。

此外,要读取输出音频,还创建了回送WASapi portaudio pyd模块。您可以在图像10BarsSpectrumAnalyzerWithWASapi.jpg中看到项目(框图)。

刚刚在我的YouTube频道上添加了一个教程视频:如何设计和制作非常聪明的Python Spectrum Analyzer 10 LED条形图

FFT return N complex values which of you can compute the module=sqrt(real_part^2+imaginary_part^2). To get the value for each band you have to sum the modules about all harmonics inside the band. Below you can see an example about a 10 bars spectrum analyzer. The c code has to be wrapped to get a pyd python module.

float *samples_vett;
float *out_filters_vett;
int Nsamples;
float band_power = 0.0;
float harmonic_amplitude=0.0;
int i, out_index;

out_index=0;


for (i = 0; i < Nsamples / 2 + 1; i++)       
        {
            if (i == 1 || i == 2 || i == 4 || i == 8 || i == 17 || i == 33 || i == 66 || i == 132 || i == 264 || i == 511)
            {
                out_filters_vett[out_index] = band_power; 
                band_power = 0; 
                out_index++;  
            }

            harmonic_amplitude = sqrt(pow(ttfr_out_vett[i].r, 2) + pow(ttfr_out_vett[i].i, 2));
            band_power += harmonic_amplitude;

        }

I designed and made a whole 10 led bar spectrum analyzer by Python. Instead to use the nunmpy library (too big and useless to get just the FFT) a python pyd module (just 27KB) to get the FFT and to split the entire audio spectrum to bands was created.

In addition, to read the output audio a loopback WASapi portaudio pyd module was created. You can see the project (block diagram) in the image 10BarsSpectrumAnalyzerWithWASapi.jpg

Just added a tutorial video on my YouTube channel: how to design and make a very smart Python Spectrum Analyzer 10 Led Bar