MATHEMATICAL MODEL OF THE SYSTEM OF ACTIVE PROTECTION AGAINST EAVESDROPPING OF SPEECH INFORMATION ON THE SCRAMBLER GENERATOR

The development of reliable systems for protecting speech information that can protect it from being intercepted by cybercriminals is a fundamental task of the security service of organizations and firms. For these purposes, active jamming systems are used at the border of the controlled area. The main element of such systems is noise generators. However, in many cases, “white” noise and its clones are used as interference, which makes it possible for an attacker to gain unauthorized access. The structure and mathematical model of a speech information protection system based on a scrambler-type noise generator is proposed. The transition in such systems of protection of speech information to this structure allows to abandon the outdated, ineffective in modern conditions, energy noise of speech information and move on to a more productive method – information (linguistic) masking. An analysis of the destructive effect of this type of interference shows its high resistance to modern methods of mathematical processing of digital phonograms (wavelet transform, correlation-spectral analysis, etc.), filtering interference, and dividing the voices of speakers. Studies of the mathematical model in the environment of Matlab 15 R2015a/Simulink show the high efficiency of such a protection system and a decrease in the signal-to-noise ratio with a residual speech intelligibility of 0.1 by 6...9 dBA. This leads to a decrease in © The Author(s) 2020 This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0). Received date 27.02.2020 Accepted date 15.04.2020 Published date 11.05.2020 Original Research Article: full paper (2020), «EUREKA: Physics and Engineering» Number 3


Literature review and problem statement
In many cases, ISG generates "white" noise or one of its clones ("pink", "brown" or "speech-like") as an interference signal. The main advantages of these ISGs are the maximum spectral density and its uniformity over a given section of the spectrum. This led to their distribution and use [4].
However, the development of digital methods for processing acoustic signals (phonograms), such as wavelet transform [5,6], correlation-spectral analysis [6], neural networks [7][8][9], etc., made it possible to develop technologies that successfully solve filtering tasks of such interference. This ensures recognition of up to 40...60 % of the linguistic information component of a speech message even for highly noisy signals -signal-to-noise ratios (SNR) reach levels of 18...-10 dBA [6,10].
An even greater increase in the level of the interfering signal significantly worsens the bioacoustic parameters of the room and negatively affects the health of personnel [11].
An alternative direction for the SIPS development became methods and technologies based on the use of the speaker's speech directly and/or manipulation with it. In the last decade, SIPS using the "speech chorus" methods ("crowd rumble", "cocktail party", etc.) and time-frequency transformations (frequency and/or temporary reverberation) have gained great commercial success. These methods have a significantly higher level of protection of speech information. However, they also do not provide reliable protection -using the correlation-spectral method, a neural network with a binary mask, etc., allows for successful reengineering of signals ("breaking up" the crowd into separate speakers and removing reverberation).
Since the SIPS functionality depends on the type of noise used and the principle of interference generation, the analysis of the system's capabilities and the quality of its operation is determined by the possibility of distinguishing the speech of a specific speaker from the general signal.
The results of Google's research on speech recognition, the source of which can be both a speaker and dynamics, against the background of other people's conversations and acoustic noise created by the room sound system, are presented in [7]. A technology is proposed that allows to highlight the voice of the speaker against the background of other voices, using a reference signal created on the basis of the voice of the speaker. The system uses two separate neural networks. This approach can significantly reduce the recognition error (WER coefficient) by more than two times -from 55.9 % to 23.4 %.
However, this method has significant drawbacks -it is difficult to implement, especially when speakers use short phrases and long pauses in speech, change speakers dynamically, etc. Another important condition for the method to work is that the speaker needs "clean" speech at the initial stage of neural network training.
Also, it is not clear from the article at what signal-to-noise ratios the results are obtained. The speaker's speech recognition technology against the background of acoustic noise, based on the method of energy control and entropy of the signal spectrum, is presented in [8]. The paper compares the possibility of detecting signs of the presence of speech information against the background of different types of interference (noise) and signal-to-noise ratios (-10 dBA<SNR<10 dBA). It is shown that under such conditions, the study of the entropy of the signal spectrum to detect signs of the speaker's speech is more effective than energy control. The studies used signals from the TIDigits database, and for noise from the Noisex database.
As disadvantages of the work, it should be noted that the Noisex noise database includes only typical (including rare and special) noises and it does not provide for the synthesis of speechlike signals based on the speech of specific speakers. Moreover, the analysis of mathematical dependencies used in [8] is not effective for highlighting the speech features of a specific speaker against the background of a speech-like signal.
The identification of signs of speech information against noise based on the ZCR (Zero Crossing Rate) and STE (Short Time Energy) methods was considered in [9]. An analysis of the results shows that both methods provide high accuracy in detecting a speech signal only for SNR>0 dBA, and for SNR≤0 dBA, both methods become ineffective.
In [12], the influence of various types of interference, including the speech of another speaker-masker, on the intelligibility of speech of the main speaker was studied. It is shown that information (linguistic) noise is more preferable in relation to energy noise ("white" noise and its clones). In this case, the quality of the noise greatly depends on the similarity of the speech of the target speaker and the speech of the speaker-masker (interference signal). The best camouflage effect was obtained when people of different sexes were used as announcers, and the worst case scenario was to use the same person as a target announcer and a masker.
However, as shown in the work, this is typical for cases with SNR≥0 dBA. For SNR<0 dBA, they failed to obtain stable results. Also, cases with significant noise levels were not considered in the work, i. e., with SNR<-10 dBA.
The study of the influence of significant noise levels and stressful conditions on auditors and their ability to recognize speaker's speech is given in [11]. The whole range of options is considered -from the auditor's confident recognition of the speaker's speech and its authentication to recognition of only the linguistic component of the message and, up to the possibility of identifying traces of speech against the background of noise. At the same time, two groups of auditors were accepted in the studies -native English speakers (L1) and people fluent in English (L2). The studies were conducted in units of the armed forces of Canada. All participants, if necessary, used individual hearing protection with integrated radio communication devices. The results show that the L1 group was much better at the task at all stages of research. The disadvantages of this work are the uniform suppression of all sounds by the hearing protection system, which at high noise levels significantly affects the ability to recognize the speaker's speech. Also in [11], only noises that are created by technical means (inside airplanes, armored vehicles, and sea vessels) are considered -that is, exclusively energy noise was investigated. In [12] it was shown that such noise is less effective than linguistic.
A system capable of storing sound from a source in a given direction relative to the auditor was proposed in [13], while sound in other directions is attenuated. Selective suppression is achieved by creating the appropriate reference signals from the output of the array of microphones worn on the user's head. The system has high selectivity, but it requires the separation of signal sources in space in azimuth. The disadvantage of the system is its sensitivity to the general level of scattered noise -for the selected signal, the signal-to-noise ratio should be SNR>0 dBA. At the same time, the signal levels from other sources can be close and even higher in level -the system will work effectively in the open.
The technology for constructing systems capable of working both in open spaces and in enclosed spaces was proposed in [14]. It is based on the method of nonlinear processing (coding) of an audio signal (DirAC -Directional Audio Coding) and the use of an optimal post-filter to control the created field. Using the DirAC method and the optimal post-filter, the authors developed a technology for the formation of a virtual spatial acoustic environment, consisting of one or more main channels and background space, in accordance with the specified parameters. The proposed technology allows to reduce distortion in the parametric spatial reproduction of sound with a significant reduction in computing costs. The simulation results obtained by the authors show that the proposed technology surpasses the system of the most modern systems for forming a virtual spatial field for complex scenarios with several sources.
The use of this technology for the formation of a virtual spatial acoustic environment is of interest for the tasks of constructing security systems for objects to be protected in order to create a camouflage background at the border of the controlled area and beyond.
Issues not resolved in [14] and requiring further research include taking into account the parameters of external sources of pulsed signals and an external background.
In [15], the issue of recognition of acoustic signals (speaker's speech) in the presence of a competing signal source in the immediate vicinity of the auditor was considered. The question is considered from the point of view of bioacoustics. The results show that informational (linguistic) interference affects the recognition and understanding of the speaker's speech by the auditor to a Original Research Article: full paper (2020), «EUREKA: Physics and Engineering» Number 3 much greater extent than energy. At the same time, participation in the experiment of same-sex speakers (main and competing) is adequate to reduce the level of interference by 4...5 dBA.
As a remark to [15], it is necessary to note the need to supplement the studies with instrumental measurements and the use of modern methods of noise filtering, similar to those that occur in the auditory tract and the human brain.
Summarizing, it can be noted that modern methods of processing acoustic information make it possible to identify and recognize the speaker's speech in conditions of low and medium noise level (SNR>-10 dBA). At the same time, the use of digital processing of phonograms makes it possible to isolate the linguistic component even for very noisy signals (with SNR≥-18...-10 dBA) when using energy methods of masking (noise). At the same time, the use of speech-like noises synthesized from the speech of the speaker and several other competing speakers (the "speech choir") showed more filter-resistant results -in the worst case, SNR>-4...-5 dBA.
Thus, the analysis of the main types of interference that can be used in SIPS, methods of their formation, as well as (most important) methods of filtering these interference and restore the text of the speaker. Based on the results of the analysis, it can be concluded that the most effective SIPS are systems that use ISG methods for transforming the speaker's speech and/or manipulation with it, as well as the "speech choir" methods.
Further development of these methods was manifested in an improved method for generating an interfering signal implemented in a scrambler-type interfering signal generator [16]. The advantage of the method is to increase the level of resistance of the interference signal to the most common methods of filtering interference and restoring the speaker's protected speech. This effect is achieved through the integrated use of modern methods of generating an interference signal -methods for manipulating the speaker's speech and the "speech choir" method. The use of speaker's speech manipulation methods provides protection from technologies (methods) for distinguishing speaker's speech from recordings where several people are talking at the same time, described in [7,12,15]. On the other hand, the use of the "speech choir" method significantly complicates the identification of the presence of signs of the speaker's speech (according to the method [9]), as well as the use of filtering methods described in [8,14].
Thus, the aim of research is development of the structure and mathematical model of a speech information protection system based on a scrambler-type signal generator.
To achieve the aim, the following tasks are set: 1. A generalized SIPS scheme based on the scrambler-type generator of speech-like interference signal is developed.
2. Development and research of SIPS simplified mathematical model based on the scrambler-type generator of speech-like interference signal.

1. Generalized scheme of a system for protecting speech information
The main idea in the SIPS development is the use of a specialized scrambler-type speechlike noise generator.
In the general case, it is envisaged that the SIPS will consist of ( Fig. 1): -scrambler-type speech-like noise generator (STSNG); -control block of hazardous signal levels (CBHSL); -sound system of the room; -set of acoustic and vibration emitters; -set of acoustic and vibration meters for hazardous signals. STSNG uses a combined method of generating interference -frequency and time scrambling of signals received from the room sound system is performed. In general, the number of signal sources (microphones) connected to the generator is determined by the parameters of the input device and does not affect the principle of operation of the generator. A set of acoustic and vibration emitters is connected to the output of the generator.
The generator implements the function of "speech choir". To this end, the generator provides for the use of a multi-channel receiver and external flash memory.

Computer Sciences
The structure and operation of the generator is considered in [16]. The generator consists of: -input device (normalizes the input signal and performs analog-to-digital conversion according to the fast Fourier transform algorithm); -frequency scrambler (performs band-shift permutations of the signal spectrum, controlled by dynamically changing passwords); -digital-to-analog converter (uses the inverse Fourier transform algorithm); -temporary scrambler (performs band signal permutations, controlled by dynamically changing passwords); -unit for generating dynamically changing passwords.

Fig. 1. Generalized block diagram of a system for protecting speech information from leakage by acoustic and vibration channels based on a scrambler-type speech-like noise generator
A feature of the generator is the use of positive feedback, providing a loop signal interference. At the same time, the quality of the noise of the speaker's speech, the resistance of the interference signal to filtering by methods of isolating the speaker's speech, removing reverberation interference, spectral analysis, etc. are significantly improved. To prevent the effect of self-excitation, a level correction block is added to the feedback, which provides the generator with a change in the interference signal in a given range. This circuitry provides the necessary level of the interference output signal even in the absence of an input signal.
Monitoring the operation of the system is carried out by the control unit for the levels of dangerous signals. To do this, the system provides for connection to a CBHSL of a set of acoustic and vibration meters of hazardous signals. Based on the measurement results, the control system generates a corrective action for the generator, and if it is impossible to eliminate the deviation, the system informs the user about it.
In general, the expression for a signal in an acoustic information distribution channel (at the border of the monitored zone) can be described by the expression: Computer Sciences account natural noise created by personnel and technical means, including from emitters in other channels (directions).

2. Mathematical model of a speech information protection system based on a scrambler-type speech-like noise generator
The simulation of the speech information protection system was performed in the Matlab 15 R2015a/Simulink environment. In the simulation, a simplified block diagram of the acoustic/ vibration information distribution channel was used -the case with one speaker was considered, the signal attenuation coefficient was taken equal to 1, and background noises were not taken into account. Also, by analogy with [16], external sources of speech signals (multichannel receiver and flash memory) are not used in the speech-like noise generator.
This mode is critical, from the point of view of the security system, and provides the minimum level of security that is possible when using this generator. However, it allows to explore the ultimate capabilities of the security system at minimum resistance to reengineering modes -the allocation of speech information from a common stream in an acoustic/vibration channel.
It is also provided that the attacker gained direct access to the external border of the allocated room -the use of an optical stethoscope ("laser" microphone) and/or mechanical-electronic stethoscopes installed directly on the building structure. Such a location is the most dangerous, since natural and artificial disturbances arising in the environment of the propagation of a dangerous signal are excluded.
Thus, the proposed approach to modeling the STSNG operation allows avoiding the choice of a mathematical model for the propagation of an acoustic signal in the controlled area and beyond, and, therefore, avoiding possible errors. The adopted conditions also simulate the worst case, in relation to the security system, and the most acceptable case for the attacker, the ratio of the system parameters and external conditions. Any complications will only increase the level of security. Fig. 2 shows a mathematical model of an active protection system against unauthorized access to speech information by remote/contact methods using a scrambler-type noise generator developed in the Matlab 15 R2015a/Simulink environment.
The basis of the model is a scrambler-type speech-like noise generator to which a signal conditioning unit in the channel, consisting of two amplifiers and an adder, is added.
Model operation is controlled by oscilloscopes (signal shape and amplitudes) and spectrum analyzers, which are standard elements in the Matlab 15 R2015a/Simulink environment: -In-A (t) oscilloscope and Fr_In-A (t) spectrum analyzer -input signal; -Out-SG (t) oscilloscope and Fr_Out-SG (t) spectrum analyzer -output signal from an noise generator; -Sum-SA (t) oscilloscope and Fr_Sum-SA (t) spectrum analyzer -resulting signal (signal on the vibrating surface of the structural element). Also, the model uses a number of other monitors to control its operation at different stages of the formation of an interference signal.
Amplifiers set the signal to noise ratio. As can be seen from Fig. 2 below, the research results were obtained at a ratio of 1:5, which corresponds to -9...-6 dBA.
To study the properties of the model, the short phrase "Brought the ship onto the landing path" was used, which, for the purpose of obtaining a continuous signal, is fixated on playback. It also allows to better explore the capabilities of the protection system. The phrase of the test signal and the research format are selected in accordance with [16].

Experimental procedures and Results
The results of mathematical modeling of a speech information protection system based on a scrambler speech-like noise generator are presented in the form of oscillograms of the main signals in the robot interval 5...15 sec. The choice of the interval is due to the task of studying the operation of the system in the steady state (without focusing on the transitional mode of starting the system).  An analysis of the results shows that: 1. The input test signal A(t) meets the requirements of simple continuous speech: -the length of the test signal is approximately 2 s (Fig. 3, a); -the interval during the repetition of the test signal is 0.2 s (Fig. 3, a); -the number of repetitions of the test signal is 4 full periods (Fig. 3, a); -the main linguistic parameters of phonemes (fundamental frequency F0 and main formants F1, F2 and F3) are clearly defined (Fig. 3, b) and are in the range from 200 Hz to 1000 Hz;

Computer Sciences
-the frequency range that the signal occupies (from 20 Hz to 5600 Hz) is typical of phonograms processed on a personal computer, without the use of additional hardware and software devices.
2. The displacements of the acoustic signal in the channel SA(t) (Fig. 5, a) with respect to the input signal A(t) (Fig. 3, a), which was mentioned in [16], were not revealed, which is due to the features channel model work.
3. Analysis of the spectrograms of the output signal of the noise generator SG(t) (Fig. 4, b) and the acoustic signal in the channel SA(t) (Fig. 5, b) shows: -the impossibility of highlighting the basic linguistic parameters of phonemes; -the range of the frequency spectrum of the signal expanded to 10 kHz; -in the spectrum of the output signal of the noise generator SG(t) (Fig. 4, b) there is a maximum zone located in the frequency range 3...7 kHz, however, it does not correlate with the spectrum of the input test either by a set of frequency components or characteristic features of the test signal A(t); -in the spectrum of the acoustic signal in the SA(t) channel (Fig. 5, b) there are three groups of maxima -the main (in the frequency range 3...7 kHz) and two edge (in the frequency range 0...1500 Hz and 8 5...10 kHz). The appearance of edge groups is determined by the influence of the input test signal A(t) and the features of the implementation of the mathematical model in Matlab 15 R2015a/Simulink. Moreover, these groups of maxima, neither by a set of frequency components, nor by characteristic features, are correlated with the spectrum of the input test signal A(t) and with each other.

Discussion
The proposed method of generating an interference signal (noise) for active acoustic and vibration jamming systems is based on the use of the principles of time and frequency scrambling. Despite the fact that the devices themselves (scramblers) were quite common at one time and were widely used in telecommunication networks, their use as noise generators has not been recorded in the literature. 7. Conclusions 1. The paper proposes a generalized structural diagram of a system for protecting speech information from leakage by acoustic and vibration channels based on a scrambler-type speech-like interference signal generator. The system provides an integrated approach to the generation of an interference signal (noise) -time and frequency permutations are used, and the "speech choir" method is also used. In order to increase the stability of the interference signal to reengineering methods, feedback has been introduced into the system. 2. A mathematical description of the system is proposed, which makes it possible to develop and study in Matlab 15 R2015a/Simulink a mathematical model of a speech information protection system from leakage by acoustic and vibration channels based on a scrambler-type speech-like interference signal generator. Studies have shown the high efficiency of the proposed method for generating interference (noise), which allows to reduce the level of the interference signal by 6...9 dBA with the same level of residual speech intelligibility (W=0.1).