write me a script to determine sample cross contamination from an exome sequencing experiement using only .vcf file output in python

To determine sample cross-contamination from an exome sequencing experiment using only a .vcf output file you can follow the steps mentioned below:

  1. Parse the VCF file using a VCF parser library such as PyVCF.
  2. Filter out variants which are known to occur in multiple samples or reference.
  3. Generate a list of "informative" variants which are polymorphic and unique to each sample.
  4. Calculate the variant frequency of informative variants in each sample.
  5. Estimate contamination as the deviation from expected ratios of these frequencies.

Here's a code snippet that demonstrates how to parse a VCF file using the PyVCF library.

main.py
import vcf

# Parse the VCF file
vcf_file = vcf.Reader(open("your_file.vcf", "r"))

# Access variants using the `vcf_file` object
for record in vcf_file:
    # Do your variant filtration and calculations here
209 chars
9 lines

The steps in detail and code for further processing can be found in this tutorial - Estimating cross-sample contamination in exome sequencing data.

gistlibby LogSnag