# Getting started¶

This tutorial is based on the quickstart example in the celerite documentation, but it has been updated to work with celerite2.

For this tutorial, we’re going to fit a Gaussian Process (GP) model to a simulated dataset with quasiperiodic oscillations. We’re also going to leave a gap in the simulated data and we’ll use the GP model to predict what we would have observed for those “missing” datapoints.

To start, here’s some code to simulate the dataset:

:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

t = np.sort(
np.append(
np.random.uniform(0, 3.8, 57),
np.random.uniform(5.5, 10, 68),
)
)  # The input coordinates must be sorted
yerr = np.random.uniform(0.08, 0.22, len(t))
y = (
0.2 * (t - 5)
+ np.sin(3 * t + 0.1 * (t - 5) ** 2)
+ yerr * np.random.randn(len(t))
)

true_t = np.linspace(0, 10, 500)
true_y = 0.2 * (true_t - 5) + np.sin(3 * true_t + 0.1 * (true_t - 5) ** 2)

plt.plot(true_t, true_y, "k", lw=1.5, alpha=0.3)
plt.errorbar(t, y, yerr=yerr, fmt=".k", capsize=0)
plt.xlabel("x [day]")
plt.ylabel("y [ppm]")
plt.xlim(0, 10)
plt.ylim(-2.5, 2.5)
_ = plt.title("simulated data") Now, let’s fit this dataset using a mixture of SHOTerm terms: one quasi-periodic component and one non-periodic component. First let’s set up an initial model to see how it looks:

:

import celerite2
from celerite2 import terms

# Quasi-periodic term
term1 = terms.SHOTerm(sigma=1.0, rho=1.0, tau=10.0)

# Non-periodic component
term2 = terms.SHOTerm(sigma=1.0, rho=5.0, Q=0.25)
kernel = term1 + term2

# Setup the GP
gp = celerite2.GaussianProcess(kernel, mean=0.0)
gp.compute(t, yerr=yerr)

print("Initial log likelihood: {0}".format(gp.log_likelihood(y)))

Initial log likelihood: -16.75164079832625


Let’s look at the underlying power spectral density of this initial model:

:

freq = np.linspace(1.0 / 8, 1.0 / 0.3, 500)
omega = 2 * np.pi * freq

def plot_psd(gp):
for n, term in enumerate(gp.kernel.terms):
plt.loglog(freq, term.get_psd(omega), label="term {0}".format(n + 1))
plt.loglog(freq, gp.kernel.get_psd(omega), ":k", label="full model")
plt.xlim(freq.min(), freq.max())
plt.legend()
plt.xlabel("frequency [1 / day]")
plt.ylabel("power [day ppt$^2$]")

plt.title("initial psd")
plot_psd(gp) And then we can also plot the prediction that this model makes for the missing data and compare it to the truth:

:

def plot_prediction(gp):
plt.plot(true_t, true_y, "k", lw=1.5, alpha=0.3, label="data")
plt.errorbar(t, y, yerr=yerr, fmt=".k", capsize=0, label="truth")

if gp:
mu, variance = gp.predict(y, t=true_t, return_var=True)
sigma = np.sqrt(variance)
plt.plot(true_t, mu, label="prediction")
plt.fill_between(true_t, mu - sigma, mu + sigma, color="C0", alpha=0.2)

plt.xlabel("x [day]")
plt.ylabel("y [ppm]")
plt.xlim(0, 10)
plt.ylim(-2.5, 2.5)
plt.legend()

plt.title("initial prediction")
plot_prediction(gp) Ok, that looks pretty terrible, but we can get a better fit by numerically maximizing the likelihood as described in the following section.

## Maximum likelihood¶

In this section, we’ll improve our initial GP model by maximizing the likelihood function for the parameters of the kernel, the mean, and a “jitter” (a constant variance term added to the diagonal of our covariance matrix). To do this, we’ll use the numerical optimization routine from scipy:

:

from scipy.optimize import minimize

def set_params(params, gp):
gp.mean = params
theta = np.exp(params[1:])
gp.kernel = terms.SHOTerm(
sigma=theta, rho=theta, tau=theta
) + terms.SHOTerm(sigma=theta, rho=theta, Q=0.25)
gp.compute(t, diag=yerr ** 2 + theta, quiet=True)
return gp

def neg_log_like(params, gp):
gp = set_params(params, gp)
return -gp.log_likelihood(y)

initial_params = [0.0, 0.0, 0.0, np.log(10.0), 0.0, np.log(5.0), np.log(0.01)]
soln = minimize(neg_log_like, initial_params, method="L-BFGS-B", args=(gp,))
opt_gp = set_params(soln.x, gp)
soln

:

      fun: -15.942826267575825
hess_inv: <7x7 LbfgsInvHessProduct with dtype=float64>
jac: array([-3.41060513e-05, -7.44648787e-04,  9.41895446e-03,  1.40687463e-04,
-3.16902059e-04,  2.33058019e-04,  2.98427951e-05])
message: 'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
nfev: 392
nit: 45
njev: 49
status: 0
success: True
x: array([ 4.64320461e-03, -3.41572228e-01,  7.00030496e-01,  1.94353586e+00,
6.05877895e-01,  3.75557887e+00, -7.87243112e+00])


Now let’s make the same plots for the maximum likelihood model:

:

plt.figure()
plt.title("maximum likelihood psd")
plot_psd(opt_gp)

plt.figure()
plt.title("maximum likelihood prediction")
plot_prediction(opt_gp)  These predictions are starting to look much better!

## Posterior inference using emcee¶

Now, to get a sense for the uncertainties on our model, let’s use Markov chain Monte Carlo (MCMC) to numerically estimate the posterior expectations of the model. In this first example, we’ll use the emcee package to run our MCMC. Our likelihood function is the same as the one we used in the previous section, but we’ll also choose a wide normal prior on each of our parameters.

:

import emcee

prior_sigma = 2.0

def log_prob(params, gp):
gp = set_params(params, gp)
return (
gp.log_likelihood(y) - 0.5 * np.sum((params / prior_sigma) ** 2),
gp.kernel.get_psd(omega),
)

np.random.seed(5693854)
coords = soln.x + 1e-5 * np.random.randn(32, len(soln.x))
sampler = emcee.EnsembleSampler(
coords.shape, coords.shape, log_prob, args=(gp,)
)
state = sampler.run_mcmc(coords, 2000, progress=True)
sampler.reset()
state = sampler.run_mcmc(state, 5000, progress=True)

100%|██████████| 2000/2000 [00:34<00:00, 57.50it/s]
100%|██████████| 5000/5000 [01:24<00:00, 59.45it/s]


After running our MCMC, we can plot the predictions that the model makes for a handful of samples from the chain. This gives a qualitative sense of the uncertainty in the predictions.

:

chain = sampler.get_chain(discard=100, flat=True)

for sample in chain[np.random.randint(len(chain), size=50)]:
gp = set_params(sample, gp)
conditional = gp.condition(y, true_t)
plt.plot(true_t, conditional.sample(), color="C0", alpha=0.1)

plt.title("posterior prediction")
plot_prediction(None) Similarly, we can plot the posterior expectation for the power spectral density:

:

psds = sampler.get_blobs(discard=100, flat=True)

q = np.percentile(psds, [16, 50, 84], axis=0)

plt.loglog(freq, q, color="C0")
plt.fill_between(freq, q, q, color="C0", alpha=0.1)

plt.xlim(freq.min(), freq.max())
plt.xlabel("frequency [1 / day]")
plt.ylabel("power [day ppt$^2$]")
_ = plt.title("posterior psd using emcee") ## Posterior inference using PyMC3¶

celerite2 also includes support for probabilistic modeling using PyMC3, and we can implement the same model from above as follows:

:

import pymc3 as pm

import celerite2.theano
from celerite2.theano import terms as theano_terms

with pm.Model() as model:

mean = pm.Normal("mean", mu=0.0, sigma=prior_sigma)
log_jitter = pm.Normal("log_jitter", mu=0.0, sigma=prior_sigma)

log_sigma1 = pm.Normal("log_sigma1", mu=0.0, sigma=prior_sigma)
log_rho1 = pm.Normal("log_rho1", mu=0.0, sigma=prior_sigma)
log_tau = pm.Normal("log_tau", mu=0.0, sigma=prior_sigma)
term1 = theano_terms.SHOTerm(
sigma=pm.math.exp(log_sigma1),
rho=pm.math.exp(log_rho1),
tau=pm.math.exp(log_tau),
)

log_sigma2 = pm.Normal("log_sigma2", mu=0.0, sigma=prior_sigma)
log_rho2 = pm.Normal("log_rho2", mu=0.0, sigma=prior_sigma)
term2 = theano_terms.SHOTerm(
sigma=pm.math.exp(log_sigma2), rho=pm.math.exp(log_rho2), Q=0.25
)

kernel = term1 + term2
gp = celerite2.theano.GaussianProcess(kernel, mean=mean)
gp.compute(t, diag=yerr ** 2 + pm.math.exp(log_jitter), quiet=True)
gp.marginal("obs", observed=y)

pm.Deterministic("psd", kernel.get_psd(omega))

trace = pm.sample(
tune=1000,
draws=1000,
target_accept=0.9,
cores=2,
chains=2,
random_seed=34923,
)

Auto-assigning NUTS sampler...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [log_rho2, log_sigma2, log_tau, log_rho1, log_sigma1, log_jitter, mean]

100.00% [4000/4000 00:25<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 38 seconds.


Like before, we can plot the posterior estimate of the power spectrum to show that the results are qualitatively similar:

:

psds = trace["psd"]

q = np.percentile(psds, [16, 50, 84], axis=0)

plt.loglog(freq, q, color="C0")
plt.fill_between(freq, q, q, color="C0", alpha=0.1)

plt.xlim(freq.min(), freq.max())
plt.xlabel("frequency [1 / day]")
plt.ylabel("power [day ppt$^2$]")
_ = plt.title("posterior psd using PyMC3") ## Posterior inference using numpyro¶

Since celerite2 includes support for JAX as well as Theano, you can also use tools like numpyro for inference. The following is similar to previous PyMC3 example, but the main difference is that (for technical reasons related to how JAX works) SHOTerms cannot be used in combination with jax.jit, so we need to explicitly specify the terms as “underdamped” (UnderdampedSHOTerm) or “overdamped” (OverdampedSHOTerm).

:

from jax.config import config

config.update("jax_enable_x64", True)

from jax import random
import jax.numpy as jnp

import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS

import celerite2.jax
from celerite2.jax import terms as jax_terms

def numpyro_model(t, yerr, y=None):
mean = numpyro.sample("mean", dist.Normal(0.0, prior_sigma))
log_jitter = numpyro.sample("log_jitter", dist.Normal(0.0, prior_sigma))

log_sigma1 = numpyro.sample("log_sigma1", dist.Normal(0.0, prior_sigma))
log_rho1 = numpyro.sample("log_rho1", dist.Normal(0.0, prior_sigma))
log_tau = numpyro.sample("log_tau", dist.Normal(0.0, prior_sigma))
term1 = jax_terms.UnderdampedSHOTerm(
sigma=jnp.exp(log_sigma1), rho=jnp.exp(log_rho1), tau=jnp.exp(log_tau)
)

log_sigma2 = numpyro.sample("log_sigma2", dist.Normal(0.0, prior_sigma))
log_rho2 = numpyro.sample("log_rho2", dist.Normal(0.0, prior_sigma))
term2 = jax_terms.OverdampedSHOTerm(
sigma=jnp.exp(log_sigma2), rho=jnp.exp(log_rho2), Q=0.25
)

kernel = term1 + term2
gp = celerite2.jax.GaussianProcess(kernel, mean=mean)
gp.compute(t, diag=yerr ** 2 + jnp.exp(log_jitter), check_sorted=False)

numpyro.sample("obs", gp.numpyro_dist(), obs=y)
numpyro.deterministic("psd", kernel.get_psd(omega))

nuts_kernel = NUTS(numpyro_model, dense_mass=True)
mcmc = MCMC(nuts_kernel, num_warmup=1000, num_samples=1000, num_chains=2)
rng_key = random.PRNGKey(34923)
%time mcmc.run(rng_key, t, yerr, y=y)

CPU times: user 13.4 s, sys: 177 ms, total: 13.6 s
Wall time: 13.3 s


This runtime was similar to the PyMC3 result from above, and (as we’ll see below) the convergence is also similar. Any difference in runtime will probably disappear for more computationally expensive models, but this interface is looking pretty great here!

As above, we can plot the posterior expectations for the power spectrum:

:

psds = np.asarray(mcmc.get_samples()["psd"])

q = np.percentile(psds, [16, 50, 84], axis=0)

plt.loglog(freq, q, color="C0")
plt.fill_between(freq, q, q, color="C0", alpha=0.1)

plt.xlim(freq.min(), freq.max())
plt.xlabel("frequency [1 / day]")
plt.ylabel("power [day ppt$^2$]")
_ = plt.title("posterior psd using numpyro") ## Comparison¶

Finally, let’s compare the results of these different inference methods a bit more quantitaively. First, let’s look at the posterior constraint on the period of the underdamped harmonic oscillator, the effective period of the oscillatory signal.

:

import arviz as az

emcee_data = az.from_emcee(
sampler,
var_names=[
"mean",
"log_sigma1",
"log_rho1",
"log_tau",
"log_sigma2",
"log_rho2",
"log_jitter",
],
)

with model:
pm_data = az.from_pymc3(trace)

numpyro_data = az.from_numpyro(mcmc)

bins = np.linspace(1.5, 2.75, 25)
plt.hist(
np.exp(np.asarray((emcee_data.posterior["log_rho1"].T)).flatten()),
bins,
histtype="step",
density=True,
label="emcee",
)
plt.hist(
np.exp(np.asarray((pm_data.posterior["log_rho1"].T)).flatten()),
bins,
histtype="step",
density=True,
label="PyMC3",
)
plt.hist(
np.exp(np.asarray((numpyro_data.posterior["log_rho1"].T)).flatten()),
bins,
histtype="step",
density=True,
label="numpyro",
)
plt.legend()
plt.yticks([])
plt.xlabel(r"$\rho_1$")
_ = plt.ylabel(r"$p(\rho_1)$") That looks pretty consistent.

Next we can look at the ArviZ summary for each method to see how the posterior expectations and convergence diagnostics look.

:

az.summary(
emcee_data,
var_names=[
"mean",
"log_sigma1",
"log_rho1",
"log_tau",
"log_sigma2",
"log_rho2",
"log_jitter",
],
)

:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
mean -0.014 1.259 -2.486 2.489 0.039 0.027 1045.0 2618.0 1.02
log_sigma1 -0.284 0.345 -0.876 0.390 0.009 0.006 1647.0 3666.0 1.01
log_rho1 0.703 0.059 0.596 0.811 0.001 0.001 1688.0 3620.0 1.02
log_tau 1.942 0.767 0.600 3.423 0.019 0.014 1593.0 3627.0 1.02
log_sigma2 0.502 0.632 -0.593 1.693 0.016 0.011 1647.0 3654.0 1.02
log_rho2 3.279 0.865 1.768 4.923 0.023 0.016 1478.0 4378.0 1.02
log_jitter -5.858 0.735 -7.265 -4.601 0.018 0.013 1751.0 3903.0 1.02
:

az.summary(
pm_data,
var_names=[
"mean",
"log_sigma1",
"log_rho1",
"log_tau",
"log_sigma2",
"log_rho2",
"log_jitter",
],
)

:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
mean 0.093 1.195 -2.256 2.527 0.037 0.031 1152.0 1021.0 1.0
log_sigma1 -0.277 0.359 -0.883 0.410 0.012 0.008 1143.0 748.0 1.0
log_rho1 0.701 0.057 0.586 0.807 0.001 0.001 1613.0 1105.0 1.0
log_tau 1.956 0.782 0.595 3.418 0.023 0.019 1337.0 969.0 1.0
log_sigma2 0.494 0.657 -0.600 1.751 0.020 0.015 1192.0 1155.0 1.0
log_rho2 3.258 0.891 1.771 5.063 0.027 0.019 1110.0 1429.0 1.0
log_jitter -5.829 0.713 -7.076 -4.535 0.017 0.012 2026.0 1251.0 1.0
:

az.summary(
numpyro_data,
var_names=[
"mean",
"log_sigma1",
"log_rho1",
"log_tau",
"log_sigma2",
"log_rho2",
"log_jitter",
],
)

:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
mean -0.042 1.245 -2.637 2.254 0.040 0.030 1077.0 871.0 1.0
log_sigma1 -0.275 0.358 -0.859 0.422 0.014 0.010 817.0 591.0 1.0
log_rho1 0.699 0.055 0.606 0.807 0.001 0.001 1919.0 1377.0 1.0
log_tau 1.972 0.810 0.596 3.559 0.028 0.023 1023.0 760.0 1.0
log_sigma2 0.494 0.626 -0.623 1.656 0.019 0.016 1159.0 1099.0 1.0
log_rho2 3.257 0.837 1.773 4.807 0.025 0.018 1123.0 1240.0 1.0
log_jitter -5.796 0.698 -7.164 -4.642 0.018 0.014 1998.0 805.0 1.0

Overall these results are consistent, but the $$\hat{R}$$ values are a bit high for the emcee run, so I’d probably run that for longer. Either way, for models like these, PyMC3 and numpyro are generally going to be much better inference tools (in terms of runtime per effective sample) than emcee, so those are the recommended interfaces if the rest of your model can be easily implemented in such a framework.