{ "cells": [ { "cell_type": "markdown", "id": "556c8ddd", "metadata": {}, "source": [ "## Distribution of TAD strength scores\n", "Using some advanced techniques, it's possible to calculate an arbitrary score for each snippet that contributed to the final pileup, and save those values within the pileup dataframe. This can be used to investigate whether the contribution features are all similar, or only some outliers cause enrichment.\n", "\n", "Here as an example, we can calculate and store the TAD strength for each TAD that was averaged." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# If you are a developer, you may want to reload the packages on a fly. \n", "# Jupyter has a magic for this particular purpose:\n", "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "id": "f39f911c", "metadata": {}, "outputs": [], "source": [ "# import standard python libraries\n", "import matplotlib as mpl\n", "%matplotlib inline\n", "mpl.rcParams['figure.dpi'] = 96\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": 3, "id": "f39f911c", "metadata": {}, "outputs": [], "source": [ "# import libraries for biological data analysis\n", "from coolpuppy import coolpup\n", "from coolpuppy.lib.numutils import get_domain_score\n", "from coolpuppy.lib.puputils import accumulate_values\n", "from coolpuppy import plotpup\n", "import cooler\n", "import bioframe\n", "import cooltools\n", "from cooltools.lib import io\n", "from cooltools import insulation, expected_cis\n", "from cooltools.lib import plotting" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Downloading test data for pileups\n", "# cache = True will doanload the data only if it was not previously downloaded\n", "# data_dir=\"./\" will force download to the current directory\n", "cool_file = cooltools.download_data(\"HFF_MicroC\", cache=True, data_dir='./')\n", "# Open cool file with Micro-C data:\n", "clr = cooler.Cooler(f'{cool_file}::/resolutions/10000')\n", "# Set up selected data resolution:\n", "resolution = 10000" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Use bioframe to fetch the genomic features from the UCSC.\n", "hg38_chromsizes = bioframe.fetch_chromsizes('hg38')\n", "hg38_cens = bioframe.fetch_centromeres('hg38')\n", "hg38_arms = bioframe.make_chromarms(hg38_chromsizes, hg38_cens)\n", "\n", "# Select only chromosomes that are present in the cooler. \n", "# This step is typically not required! we call it only because the test data are reduced. \n", "hg38_arms = hg38_arms.set_index(\"chrom\").loc[clr.chromnames].reset_index()\n", "# call this to automaticly assign names to chromosomal arms:\n", "hg38_arms = bioframe.make_viewframe(hg38_arms)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Calculate expected interactions for chromosome arms\n", "expected = expected_cis(\n", " clr,\n", " ignore_diags=0,\n", " view_df=hg38_arms,\n", " chunksize=1000000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we need to generate coordinates of TADs. It's quite simple using cooltools.insulation: we get coordinates of strongly insulating regions, which likely correspond to TAD boundaries. Then we just need to combine consecutive boundaries, filter out super long domains, and we have a list of TAD coordiantes." ] }, { "cell_type": "code", "execution_count": 7, "id": "b5075dd8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | chrom | \n", "start | \n", "end | \n", "region | \n", "is_bad_bin | \n", "log2_insulation_score_100000 | \n", "n_valid_pixels_100000 | \n", "boundary_strength_100000 | \n", "is_boundary_100000 | \n", "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "chr2 | \n", "0 | \n", "10000 | \n", "chr2_p | \n", "True | \n", "NaN | \n", "0.0 | \n", "NaN | \n", "False | \n", "
| 1 | \n", "chr2 | \n", "10000 | \n", "20000 | \n", "chr2_p | \n", "False | \n", "0.692051 | \n", "8.0 | \n", "NaN | \n", "False | \n", "
| 2 | \n", "chr2 | \n", "20000 | \n", "30000 | \n", "chr2_p | \n", "False | \n", "0.760561 | \n", "17.0 | \n", "NaN | \n", "False | \n", "
| 3 | \n", "chr2 | \n", "30000 | \n", "40000 | \n", "chr2_p | \n", "False | \n", "0.766698 | \n", "27.0 | \n", "NaN | \n", "False | \n", "
| 4 | \n", "chr2 | \n", "40000 | \n", "50000 | \n", "chr2_p | \n", "False | \n", "0.674906 | \n", "37.0 | \n", "NaN | \n", "False | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 32541 | \n", "chr17 | \n", "83210000 | \n", "83220000 | \n", "chr17_q | \n", "True | \n", "NaN | \n", "0.0 | \n", "NaN | \n", "False | \n", "
| 32542 | \n", "chr17 | \n", "83220000 | \n", "83230000 | \n", "chr17_q | \n", "True | \n", "NaN | \n", "0.0 | \n", "NaN | \n", "False | \n", "
| 32543 | \n", "chr17 | \n", "83230000 | \n", "83240000 | \n", "chr17_q | \n", "True | \n", "NaN | \n", "0.0 | \n", "NaN | \n", "False | \n", "
| 32544 | \n", "chr17 | \n", "83240000 | \n", "83250000 | \n", "chr17_q | \n", "True | \n", "NaN | \n", "0.0 | \n", "NaN | \n", "False | \n", "
| 32545 | \n", "chr17 | \n", "83250000 | \n", "83257441 | \n", "chr17_q | \n", "True | \n", "NaN | \n", "0.0 | \n", "NaN | \n", "False | \n", "
32552 rows × 9 columns
\n", "