I found this paper http://arxiv.org/pdf/1508.06576v2.pdf very clear and well-written. People quickly released its implementation such as https://github.com/kaishengtai/neuralart.
The major contribution of the paper seems to be the content/style decomposition of an image (as described in Equation 1, 5 and 7). The style representation (equation 4) seems to be the key.
It seems obvious that the “content" and “style” defined in this paper are different from what we know by common sense. The “content” is actually a mixture of both content and style, since it was defined as activations at a certain layer depth in an object-recognition ConvNet, which has no guarantee that neurons won’t be activated by “style” features. Also, the “style” was defined as correlations among feature activation patterns which, for the same reason, could encompass both content and style.
One of the imaginary (could be wrong) examples of the “style" in my mind would be “starry(style) —correlated with-- sky(content)”. So when you minimize loss function of “style” using gradient decent, you are more likely to make areas starry wherever they look like sky. And the level of starry could be adjusted by weights between content and style.
I interpret the author’s main idea as: To transform an image from X(input) to Y(output), by preserving original features while introducing new features which are highly correlated in image Z(famous painting), then balance the weights between original and new features. I know practically the authors started with a white noise image and minimized its distances from both X and Z’s representations, but the ideas are equivalent.
Potential applications of this paper's idea could be interesting too. Such as to make camouflage given environmental figures; add/remove accents of human voices; translate scientific papers into novels; translate novels into Sci Fi novels; etc.