Introduction

The gcLDA model is a generalization of the correspondence-LDA model (Blei & Jordan, 2003, “Modeling annotated data”), which is an unsupervised learning model used for modeling multiple data-types, where one data-type describes the other. The gcLDA model was introduced in the following paper:

where the model was applied for modeling the Neurosynth corpus of fMRI publications. Each publication in this corpus consists of a set of word tokens and a set of reported peak activation coordinates (x, y and z spatial coordinates corresponding to brain locations).

When applied to fMRI publication data, the gcLDA model identifies a set of T topics, where each topic captures a ‘functional region’ of the brain. More formally: each topic is associated with (1) a spatial probability distribution that captures the extent of a functional neural region, and (2) a probability distribution over linguistic features that captures the cognitive function of the region.

The gcLDA model can additionally be directly applied to other types of data. For example, Blei & Jordan presented correspondence-LDA for modeling annotated images, where pre-segmented images were represented by vectors of real-valued image features. The code provided here should be directly applicable to these types of data, provided that they are appropriately formatted. Note however that this package has only been tested on the Neurosynth dataset; some modifications may be needed for use with other datasets.

Notation

Notation Meaning
w_{i}, x_{i} The i th word token and peak activation token in the corpus, respectively
N_{w}^{(d)}, N_{x}^{(d)} The number of word tokens and peak activation tokens in document d, respectively
D The number of documents in the corpus
T The number of topics in the corpus
R The number of components/subregions in each topic’s spatial distribution (subregions model)
z_{i} Indicator variable assigning word token w_{i} to a topic
y_{i} Indicator variable assigning activation token x_{i} to a topic
z^{(d)}, y^{(d)} The set of all indicator variables for work tokens and activation tokens in document d
N_{td}^{Y D} The number of activation tokens within document d that are assigned to topic t
c_{i} Indicator variable assigning activation token y_{i} to a subregion (subregion models)
\Lambda^{t} Placeholder for all spatial parameters for topic t
\mu_{r}^{(t)}, \sigma_{r}^{(t)} Gaussian parameters for topic t
\mu^{(t)}, \sigma^{(t)} Gaussian parameters for subregion r in topic :math:`t`(subregion models)
\phi^{(t)} Multinomial distribution over word types for topic t
\phi_{w}^{(t)} Probability of word type w given topic t
\theta^{(d)} Multinomial distribution over topics for document d
\theta_{t}^{(d)} Probability of topic t given document d
\pi^{(t)} Multinomial distribution over subregions for topic t (subregion models)
\pi_{r}^{(t)} Probability of subregion r given topic t (subregion models)
\beta, \alpha, \gamma Model hyperparameters
\delta Model hyperparameter (subregion models)