The Cannon¶

The Cannon is a data-driven spectral modeling pipeline that estimates stellar labels from APOGEE spectra using a second-order polynomial model trained on a reference set of labeled spectra.

What it does¶

The Cannon estimates stellar parameters and chemical abundances from APOGEE spectra:

Stellar parameters: Teff, log g, microturbulence (v_micro), macroturbulence (v_macro)
Metallicity: [Fe/H]
Chemical abundances (as [X/Fe]): [C/Fe], [N/Fe], [O/Fe], [Na/Fe], [Mg/Fe], [Al/Fe], [Si/Fe], [S/Fe], [K/Fe], [Ca/Fe], [Ti/Fe], [V/Fe], [Cr/Fe], [Mn/Fe], [Ni/Fe]

It operates on APOGEE coadded spectra (ApogeeCoaddedSpectrumInApStar), visit spectra (ApogeeVisitSpectrumInApStar), and combined spectra (ApogeeCombinedSpectrum).

How it works¶

The generative model¶

The Cannon is a data-driven model that learns a mapping from stellar labels to spectra using a training set of spectra with known labels. At each pixel, the flux is modeled as a second-order polynomial function of the labels:

f(labels) = theta_0 + theta_1 * L1 + theta_2 * L1^2 + theta_3 * L2 + theta_4 * L1*L2 + theta_5 * L2^2 + ...

This includes a bias term, linear terms, quadratic terms, and cross-terms for all label pairs.

Training¶

The training step fits the model coefficients (theta) at each pixel independently:

Labels are normalized to zero mean and unit variance.
A design matrix is constructed from the normalized labels (including all second-order terms).
The coefficients are fit by (optionally regularized) linear regression using scikit-learn’s LinearRegression or Lasso.
A model variance (s^2) is computed at each pixel to account for model inadequacy (the difference between the model predictions and the training data beyond what is explained by observational noise).

Inference (test step)¶

For each test spectrum:

The spectrum is continuum-normalized. Two methods are used depending on the spectrum type:
- For ApogeeCombinedSpectrum: the continuum is pre-computed and stored with the spectrum.
- For ApogeeCoaddedSpectrumInApStar and visit spectra: NMF (non-negative matrix factorization) continuum normalization is used via the NMFRectify pipeline’s stored continuum parameters.
The labels are optimized using scipy.optimize.curve_fit, minimizing the chi-squared between the observed and model spectra. The total inverse variance used for weighting includes both the observational noise and the model variance:
```
adjusted_ivar = ivar / (1 + ivar * s2)
```
Multiple initial guesses are tried (zeros, +1, -1, and a linear algebra estimate), and the one with the lowest chi-squared is used.
Uncertainties are estimated from the covariance matrix of the fit.

Noise model¶

A post-hoc noise model correction is applied to the formal uncertainties using empirical calibration from TheCannon_corrections.pkl:

e_label = scale * raw_e_label + offset

Output fields¶

Stellar parameters¶

Field	Type	Description
`teff`	float	Effective temperature (K)
`e_teff`	float	Uncertainty in Teff
`logg`	float	Surface gravity (log10(cm/s^2))
`e_logg`	float	Uncertainty in log g
`fe_h`	float	Metallicity [Fe/H] (dex)
`e_fe_h`	float	Uncertainty in [Fe/H]
`v_micro`	float	Microturbulent velocity (km/s)
`e_v_micro`	float	Uncertainty in v_micro
`v_macro`	float	Macroturbulent velocity (km/s)
`e_v_macro`	float	Uncertainty in v_macro

Chemical abundances¶

Chemical abundances are reported as [X/Fe] (relative to iron):

Fields	Element
`c_fe`, `e_c_fe`	Carbon
`n_fe`, `e_n_fe`	Nitrogen
`o_fe`, `e_o_fe`	Oxygen
`na_fe`, `e_na_fe`	Sodium
`mg_fe`, `e_mg_fe`	Magnesium
`al_fe`, `e_al_fe`	Aluminum
`si_fe`, `e_si_fe`	Silicon
`s_fe`, `e_s_fe`	Sulfur
`k_fe`, `e_k_fe`	Potassium
`ca_fe`, `e_ca_fe`	Calcium
`ti_fe`, `e_ti_fe`	Titanium
`v_fe`, `e_v_fe`	Vanadium
`cr_fe`, `e_cr_fe`	Chromium
`mn_fe`, `e_mn_fe`	Manganese
`ni_fe`, `e_ni_fe`	Nickel

All labels also have raw_e_* counterparts storing the formal uncertainties before noise model correction.

Fit quality and metadata¶

Field	Type	Description
`chi2`	float	Chi-squared of the best fit
`rchi2`	float	Reduced chi-squared
`ier`	int	Integer flag from `scipy.optimize.curve_fit` indicating solver status
`nfev`	int	Number of function evaluations during optimization
`x0_index`	int	Index of the initial guess trial that produced the best chi-squared
`result_flags`	bitmask	Bitfield encoding quality flags

Spectral data¶

Field	Type	Description
`wavelength`	array	Wavelength array (log-lambda spaced, 8575 pixels)
`model_flux`	array	Best-fit model flux (continuum x rectified model)
`continuum`	array	Fitted continuum used for normalization

These spectral data arrays are stored in intermediate pickle files and loaded on demand.

Flags¶

Flag	Bit	Description
`flag_fitting_failure`	2^0	The fitting procedure failed

Summary flags¶

flag_bad: Equivalent to flag_fitting_failure.

Caveats¶

The Cannon is a data-driven method. Its accuracy is fundamentally limited by the quality and coverage of its training set. Results for stars outside the training set’s label space (the convex hull of the training labels) should be treated with extra caution.
Chemical abundances are reported as [X/Fe] (relative to iron), not [X/H] (relative to hydrogen), unlike The Payne and AstroNN.
The model assumes a second-order polynomial relationship between labels and spectra at each pixel. This can limit accuracy for labels that have more complex spectral signatures.
Continuum normalization is a critical step. For coadded and visit spectra, the pipeline requires pre-computed NMF continuum parameters from the NMFRectify pipeline. Spectra without successful NMF rectification are excluded from processing.
The model variance (s^2) term accounts for model inadequacy but can downweight informative pixels if the training set has large scatter at those pixels.
The x0_index field indicates which initial guess trial was used (0 = zeros, 1 = +1, 2 = -1, 3 = linear algebra estimate). If the best fit frequently comes from the +1 or -1 trials rather than the linear algebra estimate, it may indicate the model is struggling with certain types of spectra.
The model and continuum spectral arrays are stored as intermediate pickle files organized by source primary key, not in the database directly.
The formal uncertainties from curve_fit tend to underestimate true uncertainties. The post-hoc noise model correction addresses this.