The Cannon¶
The Cannon is a data-driven spectral modeling pipeline that estimates stellar labels from APOGEE spectra using a second-order polynomial model trained on a reference set of labeled spectra.
What it does¶
The Cannon estimates stellar parameters and chemical abundances from APOGEE spectra:
Stellar parameters: Teff, log g, microturbulence (v_micro), macroturbulence (v_macro)
Metallicity: [Fe/H]
Chemical abundances (as [X/Fe]): [C/Fe], [N/Fe], [O/Fe], [Na/Fe], [Mg/Fe], [Al/Fe], [Si/Fe], [S/Fe], [K/Fe], [Ca/Fe], [Ti/Fe], [V/Fe], [Cr/Fe], [Mn/Fe], [Ni/Fe]
It operates on APOGEE coadded spectra (ApogeeCoaddedSpectrumInApStar), visit spectra (ApogeeVisitSpectrumInApStar), and combined spectra (ApogeeCombinedSpectrum).
How it works¶
The generative model¶
The Cannon is a data-driven model that learns a mapping from stellar labels to spectra using a training set of spectra with known labels. At each pixel, the flux is modeled as a second-order polynomial function of the labels:
f(labels) = theta_0 + theta_1 * L1 + theta_2 * L1^2 + theta_3 * L2 + theta_4 * L1*L2 + theta_5 * L2^2 + ...
This includes a bias term, linear terms, quadratic terms, and cross-terms for all label pairs.
Training¶
The training step fits the model coefficients (theta) at each pixel independently:
Labels are normalized to zero mean and unit variance.
A design matrix is constructed from the normalized labels (including all second-order terms).
The coefficients are fit by (optionally regularized) linear regression using scikit-learn’s
LinearRegressionorLasso.A model variance (s^2) is computed at each pixel to account for model inadequacy (the difference between the model predictions and the training data beyond what is explained by observational noise).
Inference (test step)¶
For each test spectrum:
The spectrum is continuum-normalized. Two methods are used depending on the spectrum type:
For
ApogeeCombinedSpectrum: the continuum is pre-computed and stored with the spectrum.For
ApogeeCoaddedSpectrumInApStarand visit spectra: NMF (non-negative matrix factorization) continuum normalization is used via theNMFRectifypipeline’s stored continuum parameters.
The labels are optimized using
scipy.optimize.curve_fit, minimizing the chi-squared between the observed and model spectra. The total inverse variance used for weighting includes both the observational noise and the model variance:adjusted_ivar = ivar / (1 + ivar * s2)
Multiple initial guesses are tried (zeros, +1, -1, and a linear algebra estimate), and the one with the lowest chi-squared is used.
Uncertainties are estimated from the covariance matrix of the fit.
Noise model¶
A post-hoc noise model correction is applied to the formal uncertainties using empirical calibration from TheCannon_corrections.pkl:
e_label = scale * raw_e_label + offset
Output fields¶
Stellar parameters¶
Field |
Type |
Description |
|---|---|---|
|
float |
Effective temperature (K) |
|
float |
Uncertainty in Teff |
|
float |
Surface gravity (log10(cm/s^2)) |
|
float |
Uncertainty in log g |
|
float |
Metallicity [Fe/H] (dex) |
|
float |
Uncertainty in [Fe/H] |
|
float |
Microturbulent velocity (km/s) |
|
float |
Uncertainty in v_micro |
|
float |
Macroturbulent velocity (km/s) |
|
float |
Uncertainty in v_macro |
Chemical abundances¶
Chemical abundances are reported as [X/Fe] (relative to iron):
Fields |
Element |
|---|---|
|
Carbon |
|
Nitrogen |
|
Oxygen |
|
Sodium |
|
Magnesium |
|
Aluminum |
|
Silicon |
|
Sulfur |
|
Potassium |
|
Calcium |
|
Titanium |
|
Vanadium |
|
Chromium |
|
Manganese |
|
Nickel |
All labels also have raw_e_* counterparts storing the formal uncertainties before noise model correction.
Fit quality and metadata¶
Field |
Type |
Description |
|---|---|---|
|
float |
Chi-squared of the best fit |
|
float |
Reduced chi-squared |
|
int |
Integer flag from |
|
int |
Number of function evaluations during optimization |
|
int |
Index of the initial guess trial that produced the best chi-squared |
|
bitmask |
Bitfield encoding quality flags |
Spectral data¶
Field |
Type |
Description |
|---|---|---|
|
array |
Wavelength array (log-lambda spaced, 8575 pixels) |
|
array |
Best-fit model flux (continuum x rectified model) |
|
array |
Fitted continuum used for normalization |
These spectral data arrays are stored in intermediate pickle files and loaded on demand.
Flags¶
Flag |
Bit |
Description |
|---|---|---|
|
2^0 |
The fitting procedure failed |
Summary flags¶
flag_bad: Equivalent toflag_fitting_failure.
Caveats¶
The Cannon is a data-driven method. Its accuracy is fundamentally limited by the quality and coverage of its training set. Results for stars outside the training set’s label space (the convex hull of the training labels) should be treated with extra caution.
Chemical abundances are reported as [X/Fe] (relative to iron), not [X/H] (relative to hydrogen), unlike The Payne and AstroNN.
The model assumes a second-order polynomial relationship between labels and spectra at each pixel. This can limit accuracy for labels that have more complex spectral signatures.
Continuum normalization is a critical step. For coadded and visit spectra, the pipeline requires pre-computed NMF continuum parameters from the
NMFRectifypipeline. Spectra without successful NMF rectification are excluded from processing.The model variance (s^2) term accounts for model inadequacy but can downweight informative pixels if the training set has large scatter at those pixels.
The
x0_indexfield indicates which initial guess trial was used (0 = zeros, 1 = +1, 2 = -1, 3 = linear algebra estimate). If the best fit frequently comes from the +1 or -1 trials rather than the linear algebra estimate, it may indicate the model is struggling with certain types of spectra.The model and continuum spectral arrays are stored as intermediate pickle files organized by source primary key, not in the database directly.
The formal uncertainties from
curve_fittend to underestimate true uncertainties. The post-hoc noise model correction addresses this.