Calibrate and Summarise Multiple Radiocarbon Samples via a Bayesian Non-Parametric DPMM (with Polya Urn Updating)
Source:R/PolyaUrnBivarDirichlet.R
PolyaUrnBivarDirichlet.RdThis function calibrates sets of multiple radiocarbon (\({}^{14}\)C) determinations, and simultaneously summarises the resultant calendar age information. This is achieved using Bayesian non-parametric density estimation, providing a statistically-rigorous alternative to summed probability distributions (SPDs).
It takes as an input a set of \({}^{14}\)C determinations and associated \(1\sigma\) uncertainties, as well as the radiocarbon age calibration curve to be used. The samples are assumed to arise from an (unknown) shared calendar age distribution \(f(\theta)\) that we would like to estimate, alongside performing calibration of each sample.
The function models the underlying distribution \(f(\theta)\) as a Dirichlet process mixture model (DPMM), whereby the samples are considered to arise from an unknown number of distinct clusters. Fitting is achieved via MCMC.
It returns estimates for the calendar age of each individual radiocarbon sample; and broader output (including the means and variances of the underpinning calendar age clusters) that can be used by other library functions to provide a predictive estimate of the shared calendar age density \(f(\theta)\).
For more information read the vignette: vignette("Non-parametric-summed-density", package = "carbondate")
Note: The library provides two slightly-different update schemes for the MCMC. In this particular function, updating of the DPMM is achieved by a Polya Urn approach (Neal 2000) This is our recommended updating approach based on testing. The alternative, slice-sampled, approach can be found at WalkerBivarDirichlet
References:
Heaton, TJ. 2022. “Non-parametric Calibration of Multiple Related Radiocarbon
Determinations and their Calendar Age Summarisation.” Journal of the Royal Statistical
Society Series C: Applied Statistics 71 (5):1918-56. https://doi.org/10.1111/rssc.12599.
Neal, RM. 2000. “Markov Chain Sampling Methods for Dirichlet Process Mixture Models.”
Journal of Computational and Graphical Statistics 9 (2):249 https://doi.org/10.2307/1390653.
Usage
PolyaUrnBivarDirichlet(
rc_determinations,
rc_sigmas,
calibration_curve,
F14C_inputs = FALSE,
n_iter = 1e+05,
n_thin = 10,
use_F14C_space = TRUE,
slice_width = NA,
slice_multiplier = 10,
n_clust = min(10, length(rc_determinations)),
show_progress = TRUE,
sensible_initialisation = TRUE,
lambda = NA,
nu1 = NA,
nu2 = NA,
A = NA,
B = NA,
alpha_shape = NA,
alpha_rate = NA,
mu_phi = NA,
calendar_ages = NA
)Arguments
- rc_determinations
A vector of observed radiocarbon determinations. Can be provided either as \({}^{14}\)C ages (in \({}^{14}\)C yr BP) or as F\({}^{14}\)C concentrations.
- rc_sigmas
A vector of the (1-sigma) measurement uncertainties for the radiocarbon determinations. Must be the same length as
rc_determinationsand given in the same units.- calibration_curve
A dataframe which must contain one column
calendar_age_BP, and also columnsc14_ageandc14_sigorf14candf14c_sig(or both sets). This format matches the curves supplied with this package, e.g., intcal20, intcal13, which contain all 5 columns.- F14C_inputs
TRUEif the providedrc_determinationsare F\({}^{14}\)C concentrations andFALSEif they are radiocarbon ages. Defaults toFALSE.- n_iter
The number of MCMC iterations (optional). Default is 100,000.
- n_thin
How much to thin the MCMC output (optional). Will store every
n_thin\({}^\textrm{th}\) iteration. 1 is no thinning, while a larger number will result in more thinning. Default is 10. Must choose an integer greater than 1. Overall number of MCMC realisations stored will be \(n_{\textrm{out}} = \textrm{floor}( n_{\textrm{iter}}/n_{\textrm{thin}}) + 1\) so do not choosen_thintoo large to ensure there are enough samples from the posterior to use for later inference.- use_F14C_space
If
TRUE(default) the calculations within the function are carried out in F\({}^{14}\)C space. IfFALSEthey are carried out in \({}^{14}\)C age space. We recommend selectingTRUEas, for very old samples, calibrating in F\({}^{14}\)C space removes the potential affect of asymmetry in the radiocarbon age uncertainty. Note: This flag can be set independently of the format/scale on whichrc_determinationswere originally provided.- slice_width
Parameter for slice sampling (optional). If not given a value is chosen intelligently based on the spread of the initial calendar ages. Must be given if
sensible_initialisationisFALSE.- slice_multiplier
Integer parameter for slice sampling (optional). Default is 10. Limits the slice size to
slice_multiplier * slice_width.- n_clust
The number of clusters with which to initialise the sampler (optional). Must be less than the length of
rc_determinations. Default is 10 or the length ofrc_determinationsif that is less than 10.- show_progress
Whether to show a progress bar in the console during execution. Default is
TRUE.- sensible_initialisation
Whether to use sensible values to initialise the sampler and an automated (adaptive) prior on \(\mu_{\phi}\) and (A, B) that is informed by the observed
rc_determinations. If this isTRUE(the recommended default), then all the remaining arguments below are ignored.- lambda, nu1, nu2
Hyperparameters for the prior on the mean \(\phi_j\) and precision \(\tau_j\) of each individual calendar age cluster \(j\): $$(\phi_j, \tau_j)|\mu_{\phi} \sim \textrm{NormalGamma}(\mu_{\phi}, \lambda, \nu_1, \nu_2)$$ where \(\mu_{\phi}\) is the overall cluster centering. Required if
sensible_initialisationisFALSE.- A, B
Prior on \(\mu_{\phi}\) giving the mean and precision of the overall centering \(\mu_{\phi} \sim N(A, B^{-1})\). Required if
sensible_initialisationisFALSE.- alpha_shape, alpha_rate
Shape and rate hyperparameters that specify the prior for the Dirichlet Process (DP) concentration, \(\alpha\). This concentration \(\alpha\) determines the number of clusters we expect to observe among our \(n\) sampled objects. The model places a prior on \(\alpha \sim \Gamma(\eta_1, \eta_2)\), where \(\eta_1, \eta_2\) are the
alpha_shapeandalpha_rate. A small \(\alpha\) means the DPMM is more concentrated (i.e. we expect fewer calendar age clusters) while a large alpha means it is less less concentrated (i.e. many clusters). Required ifsensible_initialisationisFALSE.- mu_phi
Initial value of the overall cluster centering \(\mu_{\phi}\). Required if
sensible_initialisationisFALSE.- calendar_ages
The initial estimate for the underlying calendar ages (optional). If supplied, it must be a vector with the same length as
rc_determinations. Required ifsensible_initialisationisFALSE.
Value
A list with 10 items. The first 8 items contain output of the model, each of which has one dimension of size \(n_{\textrm{out}} = \textrm{floor}( n_{\textrm{iter}}/n_{\textrm{thin}}) + 1\). The rows in these items store the state of the MCMC from every \(n_{\textrm{thin}}\)\({}^\textrm{th}\) iteration:
cluster_identifiersA list of length \(n_{\textrm{out}}\) each entry gives the cluster allocation (an integer between 1 and
n_clust) for each observation on the relevant MCMC iteration. Information on the state of these calendar age clusters (means and precisions) can be found in the other output items.alphaA double vector of length \(n_{\textrm{out}}\) giving the Dirichlet Process concentration parameter \(\alpha\).
n_clustAn integer vector of length \(n_{\textrm{out}}\) giving the current number of clusters in the model.
phiA list of length \(n_{\textrm{out}}\) each entry giving a vector of length
n_clustof the means of the current calendar age clusters \(\phi_j\).tauA list of length \(n_{\textrm{out}}\) each entry giving a vector of length
n_clustof the precisions of the current calenadar age cluster \(\tau_j\).observations_per_clusterA list of length \(n_{\textrm{out}}\) each entry giving a vector of length
n_clustof the number of observations for that cluster.calendar_agesAn \(n_{\textrm{out}}\) by \(n_{\textrm{obs}}\) integer matrix. Gives the current estimate for the calendar age of each individual observation.
mu_phiA vector of length \(n_{\textrm{out}}\) giving the overall centering \(\mu_{\phi}\) of the calendar age clusters.
where \(n_{\textrm{obs}}\) is the number of radiocarbon observations i.e.
the length of rc_determinations.
The remaining items give information about the input data, input parameters (or
those calculated using sensible_initialisation) and the update_type
update_typeA string that always has the value "Polya Urn".
input_dataA list containing the \({}^{14}\)C data used, and the name of the calibration curve used.
input_parametersA list containing the values of the fixed hyperparameters
lambda,nu1,nu2,A,B,alpha_shape,alpha_rateandmu_phi, and the slice parametersslice_widthandslice_multiplier.
See also
WalkerBivarDirichlet for our less-preferred MCMC method to update the Bayesian DPMM
(otherwise an identical model); and PlotCalendarAgeDensityIndividualSample,
PlotPredictiveCalendarAgeDensity and PlotNumberOfClusters
to access the model output and estimate the calendar age information.
See also PPcalibrate for an an alternative (similarly rigorous) approach to
calibration and summarisation of related radiocarbon determinations using a variable-rate Poisson process
Examples
# Note these examples are shown with a small n_iter to speed up execution.
# When you run ensure n_iter gives convergence (try function default).
# Basic usage making use of sensible initialisation to set most values and
# using a saved example data set and the IntCal20 curve.
polya_urn_output <- PolyaUrnBivarDirichlet(
two_normals$c14_age,
two_normals$c14_sig,
intcal20,
n_iter = 100,
show_progress = FALSE)
# The radiocarbon determinations can be given as F14C concentrations
polya_urn_output <- PolyaUrnBivarDirichlet(
two_normals$f14c,
two_normals$f14c_sig,
intcal20,
F14C_inputs = TRUE,
n_iter = 100,
show_progress = FALSE)