Jekyll2018-04-22T04:07:11+00:00https://kijungyoon.github.io/KiJung YoonA blog about neural computation, machine learning, and AISonnet Installation Instructions2017-07-12T01:00:00+00:002017-07-12T01:00:00+00:00https://kijungyoon.github.io/sonnet-installation<h2 id="sonnet-on-aws-ubuntu-server-1604-lts-hvm-ssd-volume-type---ami-835b4efa">Sonnet on AWS Ubuntu Server 16.04 LTS (HVM), SSD Volume Type - ami-835b4efa</h2>
<blockquote>
<p>The main principle of Sonnet is to first construct Python objects which represent some part of a neural network, and then separately connect these objects into the TensorFlow computation graph. The objects are subclasses of <code>sonnet.AbstractModule</code> and as such are referred to as Modules. Modules may be connected into the graph multiple times, and any variables declared in that module will be automatically shared on subsequent connection calls.</p>
</blockquote>
<ol><li>
<p>We start by updating the packages of Ubuntu:</p>
<pre><code>
$ sudo apt-get update && sudo apt-get -y upgrade
</code></pre>
</li>
<li>
<p>Install pip and virtualenv by issuing one of the following commands:</p>
<pre><code>
$ sudo apt-get install python-pip python-dev python-virtualenv python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas # for Python 2.7
$ sudo apt-get install python3-pip python3-dev python-virtualenv python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas # for Python 3.n
</code></pre>
</li>
<li>
<p>Update pip:</p>
<pre><code>
$ pip install --upgrade pip
</code></pre>
</li>
<li>
<p>Create a virtualenv environment by issuing one of the following commands:</p>
<pre><code>
$ virtualenv --system-site-packages ~/tensorflow # for Python 2.7
$ virtualenv --system-site-packages -p python3 ~/tensorflow # for Python 3.n
</code></pre>
<p>where <code>~/tensorflow</code> is a target directory that specifies the top of the virtualenv tree. We may choose any directory.</p>
</li>
<li>
<p>Activate the virtualenv environment by issuing one of the following commands:</p>
<pre><code>
$ source ~/tensorflow/bin/activate # bash, sh, ksh, or zsh
$ source ~/tensorflow/bin/activate.csh # csh or tcsh
</code></pre>
<p>The preceding <code>source</code> command should change your prompt to the following:</p>
<pre><code>
(tensorflow)$
</code></pre>
</li>
<li>
<p>Issue one of the following commands to install TensorFlow in the active virtualenv environment:</p>
<pre><code>
(tensorflow)$ pip install --upgrade tensorflow # for Python 2.7
(tensorflow)$ pip3 install --upgrade tensorflow # for Python 3.n
(tensorflow)$ pip install --upgrade tensorflow-gpu # for Python 2.7 and GPU
(tensorflow)$ pip3 install --upgrade tensorflow-gpu # for Python 3.n and GPU
</code></pre>
</li>
<li>
<p>Install JDK 8 by using:</p>
<pre><code>
$ sudo apt-get install openjdk-8-jdk
</code></pre>
</li>
<li>
<p>Add Bazel distribution URI as a package source (one time setup):</p>
<pre><code>
$ echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
$ curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
</code></pre>
</li>
<li>
<p>Install and update Bazel:</p>
<pre><code>
$ sudo apt-get update && sudo apt-get install bazel
</code></pre>
<p>Once installed, you can upgrade to a newer version of Bazel with:</p>
<pre><code>
$ sudo apt-get upgrade bazel
</code></pre>
</li>
<li>
<p>Activate virtualenv again:</p>
<pre><code>
$ source ~/tensorflow/bin/activate # bash, sh, ksh, or zsh
$ source ~/tensorflow/bin/activate.csh # csh or tcsh
</code></pre>
</li>
<li>
<p>First clone the Sonnet source code with TensorFlow as a submodule:</p>
<pre><code>
$ git clone --recursive https://github.com/deepmind/sonnet
</code></pre>
<p>and then configure Tensorflow headers:</p>
<pre><code>
$ cd sonnet/tensorflow
$ ./configure
$ cd ../
</code></pre>
<p>You can choose the suggested defaults during the TensorFlow configuration. Note: This will not modify your existing installation of TensorFlow. This step is necessary so that Sonnet can build against the TensorFlow headers.</p>
</li>
<li>
<p>Build and run the installer:</p>
<pre><code>
$ mkdir /tmp/sonnet
$ bazel build --config=opt --copt="-D_GLIBCXX_USE_CXX11_ABI=0" :install
$ ./bazel-bin/install /tmp/sonnet
</code></pre>
<p><code>pip install</code> the generated wheel file:</p>
<pre><code>
$ pip install /tmp/sonnet/*.whl
</code></pre>
</li>
<li>
<p>You can verify that Sonnet has been successfully installed by, for example, trying out the resampler op:</p>
<pre><code>
$ cd ~/
$ python
>>> import sonnet as snt
>>> import tensorflow as tf
>>> snt.resampler(tf.constant([0.]), tf.constant([0.]))
</code></pre>
<p>The expected output should be:</p>
<pre><code>
<tf.Tensor 'resampler/Resampler:0' shape=(1,) dtype=float32>
</code></pre>
</li>
</ol>
<p></p>
<p>[1] <a href="https://www.tensorflow.org/install/install_linux#installing_with_virtualenv" target="_blank">https://www.tensorflow.org/install/install_linux#installing_with_virtualenv</a></p>
<p>[2] <a href="https://docs.bazel.build/versions/master/install-ubuntu.html" target="_blank">https://docs.bazel.build/versions/master/install-ubuntu.html</a></p>
<p>[3] <a href="https://github.com/deepmind/sonnet" target="_blank">https://github.com/deepmind/sonnet</a></p>Sonnet on AWS Ubuntu Server 16.04 LTS (HVM), SSD Volume Type - ami-835b4efa The main principle of Sonnet is to first construct Python objects which represent some part of a neural network, and then separately connect these objects into the TensorFlow computation graph. The objects are subclasses of sonnet.AbstractModule and as such are referred to as Modules. Modules may be connected into the graph multiple times, and any variables declared in that module will be automatically shared on subsequent connection calls.Differentiable Neural Computer2016-11-28T20:30:00+00:002016-11-28T20:30:00+00:00https://kijungyoon.github.io/DNC<p>Much as a feedforward neural network is to disentangle the spatial features of an input such as an image by approximating some functions, a recurrent neural network (RNN) aims to learn disentangled representations of the temporal features through feedback connections and parameter sharing over time. The vanilla RNN has a caveat of the vanishing gradient problem, and gated RNNs appear as an alternative but the most effective sequence models. The main idea is to create paths in time to prevent back propagated errors from vanishing or exploding. A special variant, called long short-term memory (LSTM) network has been found to be useful in many tasks such as machine translation, voice recognition, and image captioning.</p>
<p>Despite the popularity, RNNs lack the the ability to represent variables and data structures, and to store data over long timescales for solving more complex tasks including reasoning and inference problems. To hurdle the limitations, Graves et al. <a href="http://www.nature.com/nature/journal/v538/n7626/full/nature20101.html">[1]</a> introduces a differentiable neural computer (DNC), which is an extension of RNNs with an external memory bank that the network can read from and write to. It is broadly considered as augmented RNNs with attention mechanisms <a href="http://distill.pub/2016/augmented-rnns/#adaptive-computation-time">[2]</a> that would probably accelerate the power of learning machines.</p>
<p>In this post, I will focus on presenting detailed model schematics without making all intelligible because I have yet to find one such thing that visualizes the whole circuit, and there are already many instructive blogs outlining the key ingredients of the model framework. I personally use it as a cheat sheet for neural network models.</p>
<p>The earlier form of DNC is the neural Turing machine (NTM) <a href="https://arxiv.org/abs/1410.5401">[3]</a> developed by the same authors, and they share a large part of the architecture. An elementary module of NTM/DNC is LSTM, referred as to a controller because it determines the values of interface parameters that control the memory interactions. Unfolding the recurrent computation, here is the vanilla LSTM:</p>
<p><img src="https://kijungyoon.github.io/assets/images/lstm.png" alt="Markdowm Image" /></p>
<p>Note that the yellow box represents the set of (input/recurrent) weights to be learned for successfully performing a task given a time-varying input <script type="math/tex">\mathbf{x}_t</script>. Likewise, the controller of NTM is modelled by LSTM with additional sets of weights that determine interface parameters, whereby read-write access to a memory is regulated:</p>
<p><img src="https://kijungyoon.github.io/assets/images/ntm.png" alt="Markdowm Image" /></p>
<p>The first architectural difference of DNC from NTM is that DNC employs multiple read vectors and a deep LSTM although they are not necessary conditions. The qualitative difference is that DNC uses the dynamic memory allocation to ensure that blocks of allocated memory do not overlap and interfere. The new addressing mechanism allows to reuse memory when processing long sequences as opposed to NTM.</p>
<p><img src="https://kijungyoon.github.io/assets/images/dnc.png" alt="Markdowm Image" /></p>
<p>Note that there are some notational mismatches between NTM and DNC, but they are consistent with the original papers.<br />
(still working…)</p>
<p>[1] A. Graves, G. Wayne et al.<br />
<a href="https://www.nature.com/nature/journal/v538/n7626/full/nature20101.html" target="_blank">Hybrid computing using a eural network with dynamic external memory.</a><br />
Nature, 2016<br />
[2] C. Olah and S. Carter <br />
<a href="http://distill.pub/2016/augmented-rnns/#adaptive-computation-time" target="_blank">Attention and Augmented Recurrent Neural Networks</a><br />
Distill, 2016<br />
[3] A. Graves, G. Wayne, and I. Danihelka<br />
<a href="https://arxiv.org/abs/1410.5401" target="_blank">Neural Turing Machines</a><br />
ArXiv, 2014<br /></p>KiJung YoonMuch as a feedforward neural network is to disentangle the spatial features of an input such as an image by approximating some functions, a recurrent neural network (RNN) aims to learn disentangled representations of the temporal features through feedback connections and parameter sharing over time. The vanilla RNN has a caveat of the vanishing gradient problem, and gated RNNs appear as an alternative but the most effective sequence models. The main idea is to create paths in time to prevent back propagated errors from vanishing or exploding. A special variant, called long short-term memory (LSTM) network has been found to be useful in many tasks such as machine translation, voice recognition, and image captioning.Science in the age of selfies2016-08-31T12:00:00+00:002016-08-31T12:00:00+00:00https://kijungyoon.github.io/science-in-the-age-of-selfies<p>Would a time traveler from half a century ago be equally dazzled by the scientific theories and engineering technologies invented during that period as the one arriving in 50 years ago from a century ago? Geman D. & Geman S. <a href="http://www.pnas.org/content/113/34/9384.full.pdf">[1]</a> answered no with asserting that, as contrasted with the the former 50-years when groundbreaking discoveries and inventions proliferated all over the domains, the advances of the past 50-years were mostly incremental, and largely focused on newer and faster ways to gather and store information, communicate, or be entertained.</p>
<p>One of the major causes the authors remark about is the fact that we are too much “being online”, where intense and sustained concentration/thinking are inhibited by the perpetually distracting messaging and massive communication via Internet.</p>
<blockquote>
<p style="text-align:center;"><strong><em>Perhaps “thinking out of the box” has become rare </em></strong><strong><em>because the Internet is itself a box.</em></strong></p>
<p> </p>
</blockquote>
<p>Putting Grigori Perelman’s proof of the Poincare conjecture and Yitang Zhang’s contributions to the twin-prime conjecture as an extreme example of accomplishment attributed to their instinct for solitude, I totally agree on the necessity of being disconnected/unplugged for one’s own benefit, let alone science. We might have to rest on another new technology that enables to spare such time without conscious effort.</p>
<p>[1] Donald Geman and Stuart Geman<br />
<a href="http://www.pnas.org/content/113/34/9384.full.pdf" target="_blank">Opinion: Science in the age of selfies</a><br />
PNAS 2016 113 (34) 9384-9387; doi:10.1073/pnas.1609793113</p>KiJung YoonWould a time traveler from half a century ago be equally dazzled by the scientific theories and engineering technologies invented during that period as the one arriving in 50 years ago from a century ago? Geman D. & Geman S. [1] answered no with asserting that, as contrasted with the the former 50-years when groundbreaking discoveries and inventions proliferated all over the domains, the advances of the past 50-years were mostly incremental, and largely focused on newer and faster ways to gather and store information, communicate, or be entertained.Structure Estimation for Discrete Graphical Models2016-08-13T12:00:00+00:002016-08-13T12:00:00+00:00https://kijungyoon.github.io/structure-estimation-for-discrete-graphical-models<p>Graphical models for high-dimensional data are used in many applications such as computer vision, natural language processing, bioinformatics, and social networks. Learning the edge structure of an underlying graphical model from the data is of great interest because the model may be used to represent close relationships between people in a social network or interactions between (ensembles of) neurons in the brain.</p>
<p>It has been well known that the inverse covariance matrix <script type="math/tex">(\Gamma=\Sigma^{-1})</script> of any multivariate Gaussian is graph-structured as a consequence of Hammersley–Clifford theorem; zeros in <script type="math/tex">\Gamma</script> indicate the absence of an edge in the corresponding graphical model. For the non-Gaussian distributions, however, it is unknown whether the entries of <script type="math/tex">\Gamma</script> have any relationship with the strengths of correlations along edges in the graph. Loh and Wainwright <a href="https://arxiv.org/abs/1212.0478">[1]</a> studied the analog of this type of correspondence for non-Gaussian graphical models, and we will touch base on the main theorem with a binary Ising model, which is a special case of the non-Gaussian.</p>
<p><img src="https://kijungyoon.github.io/assets/images/graphical_models.jpg" alt="Markdowm Image" /></p>
<p>The figure above shows a simple graph on four nodes (row 1) and the corresponding <script type="math/tex">\Gamma</script> (row 2). Notably, the edge structure of the chain graph (a) is uncovered in</p>
<p align="center">$$\Gamma_a = \Sigma_a^{-1} = \textrm{cov}(X_1,X_2,X_3,X_4)$$</p>
<p>where the edges (1,3), (1,4), and (2,4) are represented in blue colors implying (nearly) zero in <script type="math/tex">\Gamma</script> or no edges in the graph. Note that row 2 actually displays <script type="math/tex">\log(\vert \Gamma \vert)</script> to highlight the difference between zero and nonzero values in <script type="math/tex">\Gamma</script>, and thus negative values (bluish and greenish color codes) in log domain suggest the absence of an edge due to finite precision of computations. The inverse covariance matrix <script type="math/tex">\Gamma_b</script> of the cycle (b), however, could not reveal its edge structure in the same way as the chain.</p>
<p>When do certain inverse covariances capture the structure of a graphical model? Here comes the main theorem:</p>
<blockquote>
<p><strong>THEOREM 1</strong> (Triangulation and block graph-structure) <em>Consider an arbitrary discrete graphical model of a general multinomial exponential family, and let</em> <script type="math/tex">\tilde{\mathcal{C}}</script> <em>be the set of all cliques in any triangulation of the graph</em> <script type="math/tex">G</script>. <em>Then the generalized covariance matrix</em> cov<script type="math/tex">(\Psi(X;\tilde{\mathcal{C}}))</script> <em>is invertible, and its inverse</em> <script type="math/tex">\Gamma</script> <em>is block graph-structured:</em></p>
</blockquote>
<blockquote>
<p>(a) <em>For any subset</em> <script type="math/tex">A,B \in \tilde{\mathcal{C}}</script> <em>that are not subsets of the same maximal clique, the block</em> <script type="math/tex">\Gamma(A,B)</script> <em>is identically zero.</em><br />
(b) <em>For almost all parameters</em> <script type="math/tex">\theta</script>, <em>the entire block</em> <script type="math/tex">\Gamma(A,B)</script> <em>is nonzero whenever</em> <script type="math/tex">A</script> <em>and</em> <script type="math/tex">B</script> <em>belong to a common maximal clique.</em></p>
</blockquote>
<p>The essence of the theorem lies in adding higher-order moments into the original set of random variables to compute <script type="math/tex">\Sigma</script> through a certain rule stated above. Graphs in (c) and (d) are the examples of triangulation of the cycle in (b), and the set of all cliques <script type="math/tex">\tilde{\mathcal{C}}</script> of the triangulated graph (c,e) ends up with as follows:</p>
<p align="center">$$\Psi(X) = \{ X_1,X_2,X_3,X_4,X_1X_2,X_2X_3,X_3X_4,X_1X_4,X_1X_3,X_1X_2X_3,X_1X_3X_4\}$$</p>
<p>Does the 11 <script type="math/tex">\times</script> 11 inverse of the matrix <script type="math/tex">\textrm{cov}(\Psi(X))</script> capture the aspects of the graph structure? Yes, greenish colors correspond to zeros in the inverse covariance, some of which are <script type="math/tex">(X_2,X_4),(X_2,X_3X_4), (X_2,X_1X_4)</script>. Furthermore, one of the corollaries says that a much smaller set of cliques may be sufficient to satisfy the theorem 1. Graphs (c) and (d) are the concrete examples of the corollary:</p>
<p align="center">$$\Psi(X) = \{ X_1,X_2,X_3,X_4,X_1X_3\}$$</p>
<p align="center">$$\Psi(X) = \{ X_1,X_2,X_3,X_4,X_2X_4\}$$</p>
<p>The two augmented vectors could unveil the absence of edges (1,3) and (2,4) respectively.</p>
<p>This is kind of cool because the authors have taken a leap toward making a general connection between graph structure and inverse covariance. Many remaining questions are expected to be solved such as how to deal with missing data or discrete graphical model with hidden variables.</p>
<p>[1] P. Loh and M.J. Wainwright.<br />
<a href="https://arxiv.org/abs/1212.0478" target="_blank">Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses.</a><br />
Annals of Statistics, 41(6):3022–3049, 2013</p>KiJung YoonGraphical models for high-dimensional data are used in many applications such as computer vision, natural language processing, bioinformatics, and social networks. Learning the edge structure of an underlying graphical model from the data is of great interest because the model may be used to represent close relationships between people in a social network or interactions between (ensembles of) neurons in the brain.Blog Notebooks2016-08-05T22:10:00+00:002016-08-05T22:10:00+00:00https://kijungyoon.github.io/blog-notebooks<h2 id="python-code-snippets">Python Code Snippets</h2>
<ul>
<li><a href="https://nbviewer.jupyter.org/github/kijungyoon/blog-notebooks/blob/master/demixed_principal_component_analysis.ipynb">Demixed Princicpal Component Analysis (dPCA)</a></li>
</ul>Python Code SnippetsDemixed Principal Component Analysis2016-08-05T12:00:00+00:002016-08-05T12:00:00+00:00https://kijungyoon.github.io/dPCA<p>We often encounter the brain areas where neurons are modulated by more than one variable; e.g. neurons in the prefrontal cortex (PFC) respond to both stimuli and the decisions of a subject simultaneously in various working memory tasks. Such neurons are said to have a mixed selectivity/representation, which makes it difficult to understand what information those neurons encode and how it is represented.</p>
<p>We want to unravel the neural specificity for each task parameter from a large number of population neurons. The problem reminds us of standard dimensionality reduction methods such as principal component analysis (PCA) or linear discriminant analysis (LDA) because they share the same objective to extract a few key low-dimensional components that explain the essence of high-dimensional data. Unfortunately, a vanilla PCA has no mathematical operations of actively separating the task parameters and thus the mixed selectivity remains in all principal components, whereas LDA typically aims to maximize the separation between groups of data with regard to only one parameter. Kobak et al.<a href="https://elifesciences.org/content/5/e10989">[1]</a> introduced an interesting dimensionality reduction technique, demixed principal component analysis (dPCA), to overcome the limitations of existing methods and achieve the aforementioned goal.</p>
<p>In a nutshell, the first major step of dPCA is to decompose the data <script type="math/tex">\mathbf{X}</script> into uncorrelated conditional terms <script type="math/tex">\mathbf{X}_{\phi}</script> through marginalization procedures: <script type="math/tex">\mathbf{X} = \sum_{\phi}\mathbf{X}_{\phi} + \mathbf{X}_{noise}</script> where <script type="math/tex">\mathbf{X}_{\phi}</script> is regarded as the portion dependent on the parameter <script type="math/tex">\phi</script> only (with appropriate mean subtractions followed by averaging over <script type="math/tex">\backslash \phi</script>). The next is to perform PCA-like dimensionality reduction to demix the inherently entangled selectivity of neurons with the objective of minimizing Frobenius norm of the difference between <script type="math/tex">\mathbf{X}_{\phi}</script> and <script type="math/tex">\hat{\mathbf{X}}_{\phi}</script>.</p>
<p>The core of dPCA lies in designing the linear decoder <script type="math/tex">\mathbf{D}_{\phi}</script> and encoder <script type="math/tex">\mathbf{F}_{\phi}</script> that match <script type="math/tex">\hat{\mathbf{X}}_{\phi} (=\mathbf{F}_{\phi}\mathbf{D}_{\phi}\mathbf{X})</script> to <script type="math/tex">\mathbf{X}_{\phi}</script> as much as possible. Here, <script type="math/tex">\mathbf{D}_{\phi}</script> and <script type="math/tex">\mathbf{F}_{\phi}</script> are estimated analytically by solving a linear regression problem with a sparse constraint on the regression coefficient matrix, which is known as the reduced-rank regression problem. I said “PCA-like” because <script type="math/tex">\mathbf{F}_{\phi}</script> and <script type="math/tex">\mathbf{D}_{\phi}</script> are forced to be equal in PCA but not in dPCA. Nevertheless, PCA technique is used in meeting the sparse constraint and determining final <script type="math/tex">\mathbf{D}_{\phi}</script> and <script type="math/tex">\mathbf{F}_{\phi}</script> as a consequence.</p>
<p>The authors provided extensive approaches to deal with practical issues such as overfitting and unbalanced/missing data, and they also discussed the current limitations of dPCA, which is what I strongly encourage to read.</p>
<p>I posted a <a href="https://nbviewer.jupyter.org/github/kijungyoon/blog-notebooks/blob/master/demixed_principal_component_analysis.ipynb">Jupyter notebook on dPCA</a> to revisit a toy example in the paper and visualize what dPCA actually performs. Hope it helps get to know dPCA better.</p>
<p>[1] D Kobak+, W Brendel+, C Constantinidis, CE Feierstein, A Kepecs, ZF Mainen, X-L Qi, R Romo, N Uchida, CK Machens<br />
<a href="https://elifesciences.org/content/5/e10989" target="_blank">Demixed principal component analysis of neural population data</a><br />
eLife 2016</p>KiJung YoonWe often encounter the brain areas where neurons are modulated by more than one variable; e.g. neurons in the prefrontal cortex (PFC) respond to both stimuli and the decisions of a subject simultaneously in various working memory tasks. Such neurons are said to have a mixed selectivity/representation, which makes it difficult to understand what information those neurons encode and how it is represented.