The main principle of Sonnet is to first construct Python objects which represent some part of a neural network, and then separately connect these objects into the TensorFlow computation graph. The objects are subclasses of
sonnet.AbstractModule
and as such are referred to as Modules. Modules may be connected into the graph multiple times, and any variables declared in that module will be automatically shared on subsequent connection calls.
We start by updating the packages of Ubuntu:
$ sudo apt-get update && sudo apt-get -y upgrade
Install pip and virtualenv by issuing one of the following commands:
$ sudo apt-get install python-pip python-dev python-virtualenv python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas # for Python 2.7
$ sudo apt-get install python3-pip python3-dev python-virtualenv python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas # for Python 3.n
Update pip:
$ pip install --upgrade pip
Create a virtualenv environment by issuing one of the following commands:
$ virtualenv --system-site-packages ~/tensorflow # for Python 2.7
$ virtualenv --system-site-packages -p python3 ~/tensorflow # for Python 3.n
where ~/tensorflow
is a target directory that specifies the top of the virtualenv tree. We may choose any directory.
Activate the virtualenv environment by issuing one of the following commands:
$ source ~/tensorflow/bin/activate # bash, sh, ksh, or zsh
$ source ~/tensorflow/bin/activate.csh # csh or tcsh
The preceding source
command should change your prompt to the following:
(tensorflow)$
Issue one of the following commands to install TensorFlow in the active virtualenv environment:
(tensorflow)$ pip install --upgrade tensorflow # for Python 2.7
(tensorflow)$ pip3 install --upgrade tensorflow # for Python 3.n
(tensorflow)$ pip install --upgrade tensorflow-gpu # for Python 2.7 and GPU
(tensorflow)$ pip3 install --upgrade tensorflow-gpu # for Python 3.n and GPU
Install JDK 8 by using:
$ sudo apt-get install openjdk-8-jdk
Add Bazel distribution URI as a package source (one time setup):
$ echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
$ curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
Install and update Bazel:
$ sudo apt-get update && sudo apt-get install bazel
Once installed, you can upgrade to a newer version of Bazel with:
$ sudo apt-get upgrade bazel
Activate virtualenv again:
$ source ~/tensorflow/bin/activate # bash, sh, ksh, or zsh
$ source ~/tensorflow/bin/activate.csh # csh or tcsh
First clone the Sonnet source code with TensorFlow as a submodule:
$ git clone --recursive https://github.com/deepmind/sonnet
and then configure Tensorflow headers:
$ cd sonnet/tensorflow
$ ./configure
$ cd ../
You can choose the suggested defaults during the TensorFlow configuration. Note: This will not modify your existing installation of TensorFlow. This step is necessary so that Sonnet can build against the TensorFlow headers.
Build and run the installer:
$ mkdir /tmp/sonnet
$ bazel build --config=opt --copt="-D_GLIBCXX_USE_CXX11_ABI=0" :install
$ ./bazel-bin/install /tmp/sonnet
pip install
the generated wheel file:
$ pip install /tmp/sonnet/*.whl
You can verify that Sonnet has been successfully installed by, for example, trying out the resampler op:
$ cd ~/
$ python
>>> import sonnet as snt
>>> import tensorflow as tf
>>> snt.resampler(tf.constant([0.]), tf.constant([0.]))
The expected output should be:
<tf.Tensor 'resampler/Resampler:0' shape=(1,) dtype=float32>
[1] https://www.tensorflow.org/install/install_linux#installing_with_virtualenv
[2] https://docs.bazel.build/versions/master/install-ubuntu.html
]]>Despite the popularity, RNNs lack the the ability to represent variables and data structures, and to store data over long timescales for solving more complex tasks including reasoning and inference problems. To hurdle the limitations, Graves et al. [1] introduces a differentiable neural computer (DNC), which is an extension of RNNs with an external memory bank that the network can read from and write to. It is broadly considered as augmented RNNs with attention mechanisms [2] that would probably accelerate the power of learning machines.
In this post, I will focus on presenting detailed model schematics without making all intelligible because I have yet to find one such thing that visualizes the whole circuit, and there are already many instructive blogs outlining the key ingredients of the model framework. I personally use it as a cheat sheet for neural network models.
The earlier form of DNC is the neural Turing machine (NTM) [3] developed by the same authors, and they share a large part of the architecture. An elementary module of NTM/DNC is LSTM, referred as to a controller because it determines the values of interface parameters that control the memory interactions. Unfolding the recurrent computation, here is the vanilla LSTM:
Note that the yellow box represents the set of (input/recurrent) weights to be learned for successfully performing a task given a time-varying input \(\mathbf{x}_t\). Likewise, the controller of NTM is modelled by LSTM with additional sets of weights that determine interface parameters, whereby read-write access to a memory is regulated:
The first architectural difference of DNC from NTM is that DNC employs multiple read vectors and a deep LSTM although they are not necessary conditions. The qualitative difference is that DNC uses the dynamic memory allocation to ensure that blocks of allocated memory do not overlap and interfere. The new addressing mechanism allows to reuse memory when processing long sequences as opposed to NTM.
Note that there are some notational mismatches between NTM and DNC, but they are consistent with the original papers.
(still working…)
[1] A. Graves, G. Wayne et al.
Hybrid computing using a eural network with dynamic external memory.
Nature, 2016
[2] C. Olah and S. Carter
Attention and Augmented Recurrent Neural Networks
Distill, 2016
[3] A. Graves, G. Wayne, and I. Danihelka
Neural Turing Machines
ArXiv, 2014
One of the major causes the authors remark about is the fact that we are too much “being online”, where intense and sustained concentration/thinking are inhibited by the perpetually distracting messaging and massive communication via Internet.
Perhaps “thinking out of the box” has become rare because the Internet is itself a box.
Putting Grigori Perelman’s proof of the Poincare conjecture and Yitang Zhang’s contributions to the twin-prime conjecture as an extreme example of accomplishment attributed to their instinct for solitude, I totally agree on the necessity of being disconnected/unplugged for one’s own benefit, let alone science. We might have to rest on another new technology that enables to spare such time without conscious effort.
[1] Donald Geman and Stuart Geman
Opinion: Science in the age of selfies
PNAS 2016 113 (34) 9384-9387; doi:10.1073/pnas.1609793113
It has been well known that the inverse covariance matrix \((\Gamma=\Sigma^{-1})\) of any multivariate Gaussian is graph-structured as a consequence of Hammersley–Clifford theorem; zeros in \(\Gamma\) indicate the absence of an edge in the corresponding graphical model. For the non-Gaussian distributions, however, it is unknown whether the entries of \(\Gamma\) have any relationship with the strengths of correlations along edges in the graph. Loh and Wainwright [1] studied the analog of this type of correspondence for non-Gaussian graphical models, and we will touch base on the main theorem with a binary Ising model, which is a special case of the non-Gaussian.
The figure above shows a simple graph on four nodes (row 1) and the corresponding \(\Gamma\) (row 2). Notably, the edge structure of the chain graph (a) is uncovered in
$$\Gamma_a = \Sigma_a^{-1} = \textrm{cov}(X_1,X_2,X_3,X_4)$$
where the edges (1,3), (1,4), and (2,4) are represented in blue colors implying (nearly) zero in \(\Gamma\) or no edges in the graph. Note that row 2 actually displays \(\log(\vert \Gamma \vert)\) to highlight the difference between zero and nonzero values in \(\Gamma\), and thus negative values (bluish and greenish color codes) in log domain suggest the absence of an edge due to finite precision of computations. The inverse covariance matrix \(\Gamma_b\) of the cycle (b), however, could not reveal its edge structure in the same way as the chain.
When do certain inverse covariances capture the structure of a graphical model? Here comes the main theorem:
THEOREM 1 (Triangulation and block graph-structure) Consider an arbitrary discrete graphical model of a general multinomial exponential family, and let \(\tilde{\mathcal{C}}\) be the set of all cliques in any triangulation of the graph \(G\). Then the generalized covariance matrix cov\((\Psi(X;\tilde{\mathcal{C}}))\) is invertible, and its inverse \(\Gamma\) is block graph-structured:
(a) For any subset \(A,B \in \tilde{\mathcal{C}}\) that are not subsets of the same maximal clique, the block \(\Gamma(A,B)\) is identically zero.
(b) For almost all parameters \(\theta\), the entire block \(\Gamma(A,B)\) is nonzero whenever \(A\) and \(B\) belong to a common maximal clique.
The essence of the theorem lies in adding higher-order moments into the original set of random variables to compute \(\Sigma\) through a certain rule stated above. Graphs in (c) and (d) are the examples of triangulation of the cycle in (b), and the set of all cliques \(\tilde{\mathcal{C}}\) of the triangulated graph (c,e) ends up with as follows:
$$\Psi(X) = \{ X_1,X_2,X_3,X_4,X_1X_2,X_2X_3,X_3X_4,X_1X_4,X_1X_3,X_1X_2X_3,X_1X_3X_4\}$$
Does the 11 \(\times\) 11 inverse of the matrix \(\textrm{cov}(\Psi(X))\) capture the aspects of the graph structure? Yes, greenish colors correspond to zeros in the inverse covariance, some of which are \((X_2,X_4),(X_2,X_3X_4), (X_2,X_1X_4)\). Furthermore, one of the corollaries says that a much smaller set of cliques may be sufficient to satisfy the theorem 1. Graphs (c) and (d) are the concrete examples of the corollary:
$$\Psi(X) = \{ X_1,X_2,X_3,X_4,X_1X_3\}$$
$$\Psi(X) = \{ X_1,X_2,X_3,X_4,X_2X_4\}$$
The two augmented vectors could unveil the absence of edges (1,3) and (2,4) respectively.
This is kind of cool because the authors have taken a leap toward making a general connection between graph structure and inverse covariance. Many remaining questions are expected to be solved such as how to deal with missing data or discrete graphical model with hidden variables.
[1] P. Loh and M.J. Wainwright.
Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses.
Annals of Statistics, 41(6):3022–3049, 2013
We want to unravel the neural specificity for each task parameter from a large number of population neurons. The problem reminds us of standard dimensionality reduction methods such as principal component analysis (PCA) or linear discriminant analysis (LDA) because they share the same objective to extract a few key low-dimensional components that explain the essence of high-dimensional data. Unfortunately, a vanilla PCA has no mathematical operations of actively separating the task parameters and thus the mixed selectivity remains in all principal components, whereas LDA typically aims to maximize the separation between groups of data with regard to only one parameter. Kobak et al.[1] introduced an interesting dimensionality reduction technique, demixed principal component analysis (dPCA), to overcome the limitations of existing methods and achieve the aforementioned goal.
In a nutshell, the first major step of dPCA is to decompose the data \(\mathbf{X}\) into uncorrelated conditional terms \(\mathbf{X}_{\phi}\) through marginalization procedures: \(\mathbf{X} = \sum_{\phi}\mathbf{X}_{\phi} + \mathbf{X}_{noise}\) where \(\mathbf{X}_{\phi}\) is regarded as the portion dependent on the parameter \(\phi\) only (with appropriate mean subtractions followed by averaging over \(\backslash \phi\)). The next is to perform PCA-like dimensionality reduction to demix the inherently entangled selectivity of neurons with the objective of minimizing Frobenius norm of the difference between \(\mathbf{X}_{\phi}\) and \(\hat{\mathbf{X}}_{\phi}\).
The core of dPCA lies in designing the linear decoder \(\mathbf{D}_{\phi}\) and encoder \(\mathbf{F}_{\phi}\) that match \(\hat{\mathbf{X}}_{\phi} (=\mathbf{F}_{\phi}\mathbf{D}_{\phi}\mathbf{X})\) to \(\mathbf{X}_{\phi}\) as much as possible. Here, \(\mathbf{D}_{\phi}\) and \(\mathbf{F}_{\phi}\) are estimated analytically by solving a linear regression problem with a sparse constraint on the regression coefficient matrix, which is known as the reduced-rank regression problem. I said “PCA-like” because \(\mathbf{F}_{\phi}\) and \(\mathbf{D}_{\phi}\) are forced to be equal in PCA but not in dPCA. Nevertheless, PCA technique is used in meeting the sparse constraint and determining final \(\mathbf{D}_{\phi}\) and \(\mathbf{F}_{\phi}\) as a consequence.
The authors provided extensive approaches to deal with practical issues such as overfitting and unbalanced/missing data, and they also discussed the current limitations of dPCA, which is what I strongly encourage to read.
I posted a Jupyter notebook on dPCA to revisit a toy example in the paper and visualize what dPCA actually performs. Hope it helps get to know dPCA better.
[1] D Kobak+, W Brendel+, C Constantinidis, CE Feierstein, A Kepecs, ZF Mainen, X-L Qi, R Romo, N Uchida, CK Machens
Demixed principal component analysis of neural population data
eLife 2016