Why synthetic data
Industrial EM labs don't need to label more data — they need a simulator that matches their microscope.
The binding constraint in industrial electron microscopy is labelled data scarcity: real-world datasets are proprietary, expensive to annotate, and often live under NDA. NeuralSoftX's signature approach is physics-based synthetic data generation — every simulated training image is the output of a full forward model, from electron–specimen scattering through the detector itself.
Electron–specimen physics
Multislice simulation via MULTEM covers elastic and inelastic scattering, frozen-lattice thermal diffuse scattering (essential for quantitative HAADF-STEM), and full aberration-function modelling. Scattering potentials use the Lobato–Van Dyck 2014 parameterisation (Acta Crystallographica A) — roughly an order of magnitude more accurate than Doyle–Turner or Kirkland parameterisations across the periodic table.
Detector recording physics
The detector forward model is instrument-specific, not a generic Gaussian. It covers the point-spread function (PSF), modulation transfer function (MTF), detective quantum efficiency (DQE(0) and DQE(ν)), Poisson electron shot noise, Gaussian readout noise, gain and dark reference, and — for STEM — fast-scan and slow-scan distortion following Ophus, Ciston & Nelson (Ultramicroscopy 2016).
Domain randomisation
The simulator sweeps dose, defocus, spherical aberration, PSF width, gain, and contamination distributions so that networks trained on synthetic data generalise to real experimental micrographs. Sim-only training is used where labels are impossible (e.g. defect detection, atom segmentation); pretrain-then-finetune is used where a small amount of experimental data is available.
Published, validated on real data
This is not a theoretical pitch. The full pipeline was published and validated on experimental images in Lobato, Friedrich & Van Aert, npj Computational Materials 10, 10 (2024) — a deep CNN for single-shot electron-microscopy restoration trained purely on MULTEM-simulated data.