Skip to content

Commit 9bdba95

Browse files
committed
updates for arxiv submission
1 parent 649ef00 commit 9bdba95

File tree

1 file changed

+93
-21
lines changed

1 file changed

+93
-21
lines changed

paper.tex

Lines changed: 93 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
%\usepackage{nips_2017}
1111

1212
% to compile a camera-ready version, add the [final] option, e.g.:
13-
\usepackage{nips_2017}
13+
\usepackage[final,nonatbib]{nips_2017}
1414

1515
\usepackage[utf8]{inputenc} % allow utf-8 input
1616
\usepackage[T1]{fontenc} % use 8-bit T1 fonts
@@ -89,15 +89,15 @@ \section{Introduction}
8989
% There have been a few investigations of audio style transfer employing magnitude representations, such as Ulyanov's original work and a follow-up work employing VGG \cite{Wyse2017}. These models discard the phase information in favor of phase reconstruction. As well, there have been further developments in neural networks capable of large scale audio classification such as \cite{Hershey2016}, though these are trained on magnitude representations and would also require phase reconstruction as part of a stylization process. Perhaps most closely aligned is the work of NSynth \cite{Engel2017}, whose work is capable of taking as input a raw audio signal and allows for applications such as the blending of musical notes in a neural embedding space. Though their work is capable of synthesizing raw audio from its embedding space, there is no separation of content and style, and thus they cannot be independently manipulated.
9090

9191
% Speech synthesis techniques
92-
%TacoTron demonstrated a technique using ...
93-
%In a similar vein, WaveNet, ...
92+
%TacoTron demonstrated a technique using ...
93+
%In a similar vein, WaveNet, ...
9494
%NSynth incorporates a WaveNet decoder and includes an additional encoder, allowing one to encode a time domain audio signal using the encoding part of the network with 16 channels at 125x compression, and use these as biases during the WaveNet decoding. The embedding space is capable of linearly mixing instruments in its embedding space, though has yet to be explored as a network for audio stylization where content and style are indepenently manipulated.
9595

9696
% SampleRNN
9797

9898
% Soundnet
9999

100-
% VGG (Lonce Wyse, https://arxiv.org/pdf/1706.09559.pdf);
100+
% VGG (Lonce Wyse, https://arxiv.org/pdf/1706.09559.pdf);
101101

102102
% Zdenek Pruska
103103

@@ -106,11 +106,11 @@ \section{Introduction}
106106
% CycleGAN (https://gauthamzz.github.io/2017/09/23/AudioStyleTransfer/)
107107

108108

109-
\section{Experiments\footnote{Further details are described in the Supplementary Materials}}
109+
\section{Experiments}
110110

111111
We explore a variety of computational graphs which use as their first operation a discrete Fourier transform in order to project an audio signal into its real and imaginary components. We then explore manipulations on these components, including directly applying convolutional layers, or undergoing an additional transformation of the typical magnitude and phase components, as well as combinations of each these components. For representing phase, we also explored using the original phase, the phase differential, and the unwrapped phase differentials. From here, we apply the same techniques for stylization as described in \cite{Ulyanov2016}, except we no longer have to optimize a noisy magnitude input, and can instead optimize a time domain signal. We also explore combinations of using content/style layers following the initial projections and after fully connected layers.
112112

113-
We also explore two pre-trained networks: a pre-trained WaveNet decoder, and the encoder portion of an NSynth network as provided by Magenta \cite{Engel2017}, and look at the activations of each of these networks at different layers, much like the original image style networks did with VGG. We also include Ulyanov's original network as a baseline, and report our results as seen through spectrograms and through listening. Our code is also available online\footnote{https://github.com/pkmital/neural-audio-style-transfer}.
113+
We also explore two pre-trained networks: a pre-trained WaveNet decoder, and the encoder portion of an NSynth network as provided by Magenta \cite{Engel2017}, and look at the activations of each of these networks at different layers, much like the original image style networks did with VGG. We also include Ulyanov's original network as a baseline, and report our results as seen through spectrograms and through listening. Our code is also available online\footnote{https://github.com/pkmital/neural-audio-style-transfer}\footnote{Further details are described in the Supplementary Materials}.
114114

115115
\section{Results}
116116

@@ -124,29 +124,101 @@ \section{Discussion and Conclusion}
124124

125125
This work explores neural audio style transfer of a time domain audio signal. Of these networks, only two produced any meaningful results: the magnitude and unwrapped phase network, which produced distinctly noisier syntheses, and the real, imaginary, and magnitude network which was capable of resembling both the content and style sources in a similar quality to Ulyanov's original approach, though with interesting differences. It was especially surprising that we were unable to stylize with NSynth's encoder or decoder, though this is perhaps to due to the limited number of combinations of layers and possible activations we explored, and is worth exploring more in the future.
126126

127-
% Style transfer, like deep dream and its predecessor works in visualizing gradient activations, through exploration have the potential to enable us to understand representations created by neural networks. Through synthesis, and exploring the representations at each level of a neural network, we can start to gain insights into what sorts of representations if any are created by a network. However, to date, very few explorations of audio networks for the purpose of dreaming or stylization have been done.
127+
% Style transfer, like deep dream and its predecessor works in visualizing gradient activations, through exploration have the potential to enable us to understand representations created by neural networks. Through synthesis, and exploring the representations at each level of a neural network, we can start to gain insights into what sorts of representations if any are created by a network. However, to date, very few explorations of audio networks for the purpose of dreaming or stylization have been done.
128128

129129
%End to end learning, http://www.mirlab.org/conference_papers/International_Conference/ICASSP\%202014/papers/p7014-dieleman.pdf - spectrums still do better than raw audio.
130130

131131
\small
132-
\bibliographystyle{IEEEtran}
133-
\bibliography{style-transfer}
134-
135-
\section{Supplementary Material}
136-
137-
\subsection{Input Data}
138-
139-
Each of the shallow untrained networks we used take as input a raw audio signal sampled at 22050 Hz with a frame size of 2048 samples, an alpha of 0.01, and use 150 iterations of the Adam optimizer. We explored manipulations in sample rate including [44100, 22050, and 16000]. For frame size (DFT size was always set to half frame size with no padding or centering), we explored [1024, 2048, 4096, 8192], with hop sizes of [128, 256, 512]. The resulting projections from a discrete Fourier basis set were then sliced to half width to remove their symmetric projections.
140-
141-
For the NSynth and WaveNet networks, we used the native sampling rate they were trained on, 16000 Hz. For the shallow untrained networks, we explored a combination of networks that varied in their initial input processing, depth, and the number layers and information we used for content and stylization. We tested networks which incorporated the real, imaginary, magnitude, and phase information of an audio source signal's DFT, as computed with a computational graph capable of automatic differentiation. This enabled us to apply stylization by optimizing an input noise signal, while keeping the rest of the network untrained.
142-
143-
\subsection{Network}
144-
145-
Ulyanov's original stylization network uses depth-wise convolution as the first layer operating on the magnitudes. We employ the same technique here, except using combinations of the real, imaginary, magnitude, and phase information as input, stacked along the height dimension. For kernel sizes, we tried a variety of widths, including ${[4, 8, 16]}$, and for heights, depending on the number of components included we tried ${[1, H]}$, where $H$ is the total number of components included in the model. For instance, for a model incorporating real and imaginary components, we set $H = 2$, and stacked the real and imaginary comopnents in rows. For number of layers, we tried ${[1, 2, 3]}$. And finally, for representing phase, we tried the original phase, the phase differential, and the unwrapped phase differentials. We used a stride of 1 and a ReLu activation for all convolutional layers, and followed the weight initialization used by Ulyanov's baseline audio stylization network. Finally, we explored alphas including [0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001].
132+
% Generated by IEEEtran.bst, version: 1.14 (2015/08/26)
133+
\begin{thebibliography}{1}
134+
\providecommand{\url}[1]{#1}
135+
\csname url@samestyle\endcsname
136+
\providecommand{\newblock}{\relax}
137+
\providecommand{\bibinfo}[2]{#2}
138+
\providecommand{\BIBentrySTDinterwordspacing}{\spaceskip=0pt\relax}
139+
\providecommand{\BIBentryALTinterwordstretchfactor}{4}
140+
\providecommand{\BIBentryALTinterwordspacing}{\spaceskip=\fontdimen2\font plus
141+
\BIBentryALTinterwordstretchfactor\fontdimen3\font minus
142+
\fontdimen4\font\relax}
143+
\providecommand{\BIBforeignlanguage}[2]{{%
144+
\expandafter\ifx\csname l@#1\endcsname\relax
145+
\typeout{** WARNING: IEEEtran.bst: No hyphenation pattern has been}%
146+
\typeout{** loaded for the language `#1'. Using the pattern for}%
147+
\typeout{** the default language instead.}%
148+
\else
149+
\language=\csname l@#1\endcsname
150+
\fi
151+
#2}}
152+
\providecommand{\BIBdecl}{\relax}
153+
\BIBdecl
154+
155+
\bibitem{Ulyanov2016}
156+
D.~Ulyanov and V.~Lebedev, ``{Audio texture synthesis and style transfer},''
157+
2016.
158+
159+
\bibitem{Gatys}
160+
L.~A. Gatys, A.~S. Ecker, M.~Bethge, and C.~V. Sep, ``{A Neural Algorithm of
161+
Artistic Style},'' \emph{Arxiv}, p. 211839, 2015.
162+
163+
\bibitem{Ulyanov2016b}
164+
\BIBentryALTinterwordspacing
165+
D.~Ulyanov, V.~Lebedev, A.~Vedaldi, and V.~Lempitsky, ``{Texture Networks:
166+
Feed-forward Synthesis of Textures and Stylized Images},'' 2016. [Online].
167+
Available: \url{http://arxiv.org/abs/1603.03417}
168+
\BIBentrySTDinterwordspacing
169+
170+
\bibitem{Griffin1984}
171+
D.~W. Griffin and J.~S. Lim, ``{Signal Estimation from Modified Short-Time
172+
Fourier Transform},'' \emph{IEEE Transactions on Acoustics, Speech, and
173+
Signal Processing}, vol.~32, no.~2, pp. 236--243, 1984.
174+
175+
\bibitem{Wyse2017}
176+
\BIBentryALTinterwordspacing
177+
L.~Wyse, ``{Audio Spectrogram Representations for Processing with Convolutional
178+
Neural Networks},'' in \emph{Proceedings of the First International Workshop
179+
on Deep Learning and Music joint with IJCNN}, vol.~1, no.~1, 2017, pp.
180+
37--41. [Online]. Available: \url{http://arxiv.org/abs/1706.09559}
181+
\BIBentrySTDinterwordspacing
182+
183+
\bibitem{Prusa2017}
184+
Z.~Prů{\v{s}}a and P.~Rajmic, ``{Toward High-Quality Real-Time Signal
185+
Reconstruction from STFT Magnitude},'' \emph{IEEE Signal Processing Letters},
186+
vol.~24, no.~6, pp. 892--896, 2017.
187+
188+
\bibitem{Hershey2016}
189+
\BIBentryALTinterwordspacing
190+
S.~Hershey, S.~Chaudhuri, D.~P.~W. Ellis, J.~F. Gemmeke, A.~Jansen, C.~Moore,
191+
M.~Plakal, D.~Platt, R.~A. Saurous, B.~Seybold, M.~Slaney, R.~J. Weiss,
192+
K.~Wilson, R.~C. Moore, M.~Plakal, D.~Platt, R.~A. Saurous, B.~Seybold,
193+
M.~Slaney, R.~J. Weiss, and K.~Wilson, ``{CNN Architectures for Large-Scale
194+
Audio Classification},'' \emph{International Conference on Acoustics, Speech
195+
and Signal Processing (ICASSP)}, pp. 4--8, 2016. [Online]. Available:
196+
\url{http://arxiv.org/abs/1609.09430}
197+
\BIBentrySTDinterwordspacing
198+
199+
\bibitem{Engel2017}
200+
\BIBentryALTinterwordspacing
201+
J.~Engel, C.~Resnick, A.~Roberts, S.~Dieleman, D.~Eck, K.~Simonyan, and
202+
M.~Norouzi, ``{Neural Audio Synthesis of Musical Notes with WaveNet
203+
Autoencoders},'' in \emph{Proceedings of the 34th International Conference on
204+
Machine Learning}, 2017. [Online]. Available:
205+
\url{http://arxiv.org/abs/1704.01279}
206+
\BIBentrySTDinterwordspacing
207+
208+
\bibitem{Oord2016b}
209+
\BIBentryALTinterwordspacing
210+
A.~van~den Oord, S.~Dieleman, H.~Zen, K.~Simonyan, O.~Vinyals, A.~Graves,
211+
N.~Kalchbrenner, A.~Senior, and K.~Kavukcuoglu, ``{WaveNet: A Generative
212+
Model for Raw Audio},'' \emph{arxiv}, pp. 1--15, 2016. [Online]. Available:
213+
\url{http://arxiv.org/abs/1609.03499}
214+
\BIBentrySTDinterwordspacing
215+
216+
\end{thebibliography}
146217

147218
\begin{figure}
148219
\centering
149220
\includegraphics[width=1\linewidth]{synthesis}
150221
\caption{Example synthesis optimizing audio directly with both the source content and style audible.}
151222
\end{figure}
152223
\end{document}
224+

0 commit comments

Comments
 (0)