Skip to content

normalize

Left-align indels, split multi-allelic sites into biallelic records, and optionally validate REF alleles against a reference FASTA.

Terminal window
vcfkit normalize [OPTIONS] --reference <FASTA> [INPUT]

Input defaults to stdin if not provided. Output defaults to stdout.

FlagDescription
-f, --reference <FASTA>Reference genome FASTA (required)
-o, --output <FILE>Output file (default: stdout)
--no-splitKeep multi-allelic sites (don’t split)
--no-left-alignSkip left-alignment of indels
--check-ref <MODE>How to handle REF mismatches: ignore / warn (default) / error
--fastEnable fast path for biallelic SNPs/MNPs (~4× faster)
-q, --quietSuppress progress bar and stats
Terminal window
# Standard: left-align + split multi-allelic
vcfkit normalize -f hg38.fa input.vcf > normalized.vcf
# Keep multi-allelic sites (normalize in place, don't split)
vcfkit normalize -f hg38.fa --no-split input.vcf > normalized.vcf
# Fast path (biallelic SNPs/MNPs only — indels use standard path)
vcfkit normalize --fast -f hg38.fa input.vcf > normalized.vcf
# Error on REF mismatch (strict mode)
vcfkit normalize -f hg38.fa --check-ref error input.vcf > normalized.vcf
# From stdin, to file
bcftools view input.bcf | vcfkit normalize -f hg38.fa -o normalized.vcf

Implements the Tan et al. 2015 algorithm: repeatedly shift the variant left while the last base of REF equals the last base of ALT and the previous reference base equals the first base of REF/ALT.

When a record has multiple ALT alleles (e.g., REF=A ALT=T,C), each ALT is written as a separate biallelic record. INFO fields with Number=A (one value per ALT allele) are sliced. Number=R fields (one value per allele, including REF) are also re-sliced. Number=1 and Number=. fields are copied verbatim.

The fast path reads raw VCF lines. For biallelic SNPs and MNPs — the majority of records in 1000 Genomes-style VCFs — it writes them as raw bytes without full noodles serialization. On SNP-heavy VCFs, this is ~4× faster than the standard path.

Multi-allelic records and indels (when --left-align is on) fall back to the full noodles pipeline. The flag is opt-in because it changes the code path — use the differential tests to verify behavior on your data.

vcfkit commandbcftools equivalent
normalize -f ref.fabcftools norm -f ref.fa -m-any -c w
normalize -f ref.fa --no-splitbcftools norm -f ref.fa -c w

Multi-allelic indels are currently passed through unchanged rather than left-aligned. Biallelic left-alignment is fully implemented; joint multi-allelic left-alignment requires the full Tan 2015 multi-ALT extension (planned v0.2).

See Known differences for details.

By default (--check-ref warn), vcfkit emits a warning to stderr for each record where the first base of REF doesn’t match the reference FASTA at that position. Only the first base is checked — consistent with bcftools behavior.

  • warn — log the mismatch, continue (default)
  • error — abort on first mismatch
  • ignore — skip all REF checking (fastest; useful when you trust your VCF)