############################################################## # Methodologies Full details of the methods used to create the datasets are provided in: Pink, C.J., Swaminathan, S.K., Dunham, I., Rogers, J., Ward, A., and Hurst, L.D., (2009). Evidence that replication-associated mutation alone does not explain between-chromosome differences in substitution rates. Genome Biology and Evolution. 1(1): 13-22 Pink, C.J. and Hurst, L.D. (2010). Timing of replication is a determinant of neutral substitution rates but does not explain slow Y chromosome evolution in rodents. Molecular Biology and Evolution. 27(5): 1077-1086 ############################################################## # Scripts Two tcl script libraries are provided. For both, the main script mm_rn_Ki_RT_(with/no)_filter.tcl controls procedures in the associated script library in folder mm-rn_Ki_RT_(with/no)_filter_library. Both must be located within the same folder. Note that file and directory naming reflects the two sets of scripts that are provided: Scipts_With_selection_filter: contains scripts to create the filtered dataset that underpins the main findings of Pink and Hurst (2010). These scripts include the runs test described in Pink et al. (2009) in the filter for introns under selection. Scripts_No_selection_filter: contains scripts to create the unfiltered dataset that support the supplementary findings, within which the runs test is commented out of the alignment parameter script. The scripts are dependent on a working installation of the LAGAN alignment software (Brudno, M., Do, C. B., Cooper, G. M., Kim, M. F., Davydov, E., Program, N. C. S., Green, E. D., Sidow, A. & Batzoglou, S. (2003). LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Research, 13:721-731), Obtained from http://lagan.stanford.edu/lagan_web/index.shtml during October 2007. In order to correctly assign replication times to intronic substitution rates, mouse genomic positions must be converted from the February 2006 assembly (mm8) to the July 2007 assembly (mm9). To do this the command line version of the University of California Santa Cruz (UCSC) LiftOver tool must be installed, with associated chain file mm9ToMm8.over.chain. LiftOver and mm9ToMm8.over.chain file were obtained from http://genome.ucsc.edu/cgi-bin/hgLiftOver during May 2009. Intronic substitution rates were calculated using the knc.pl script. This script was written by Prof Dr Martin Lercher. All other scripts were written by Dr Catherine Pink. ############################################################## # Input files The following input files are the source data that the scripts published here process. Input files must be stored in a directory named Input_Files located in the same folder as the main mm_rn_Ki_RT_(with/no)_filter.tcl script and the appropriate script library. These input data files were obtained from publicly accessible repositories, as listed below. Where possible, citations and source URLs are provided, together with approximate dates of access. No guarantee is provided that these source data remain available. Individual data repositories should be contacted for access. Mouse_Jul2007_Exin.tfa - Mouse genome build mm9 obtained from UCSC Table Browser during January 2008: Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D493-6. Table Browser options: Assembly: July 2007 (NCBI37/mm9); Group: Genes and Gene Predictions; Track: RefSeq Genes; Output format: genomic sequence; Sequence Retrieval Region Options: Introns, one FASTA per region, 0 extra 5' bases, 0 extra 3' bases. Exon CDS: Uppercase; Introns: lowercase; Note that the file name was chose by the data creator and is required for correct function of the accompanying scripts. Rat_Nov2004_EXin.tfa - Rat genome Assembly Nov 204 (Baylor 3.4/rn4) obtained from UCSC Table Browser with settings as for mouse genome during November 2007. HMD_Rat5.rpt - Citation: Eppig, J. T., J. A. Blake, C. J. Bult, J. A. Kadin, J. E. Richardson, and the Mouse Genome Database Group. 2007. The mouse genome database (MGD): new features facilitating a model system. Nucleic Acids Res. 35:630-637. Obtained from http://informatics.jax.org/ February 2007. mm9ToMm8.over.chain - Chain file for use with UCSC command line liftOver tool, obtained from http://hgdownload.cse.ucsc.edu/goldenPath/mm9/liftOver/ during May 2009. mouse.rna.gbff - obtained from NCBI during May 2009 rat.rna.gbff - obtained from NCBI during May 2009 NM_011667.txt - Obtained from UCSC Table Browser using the setting as for the Mouse genome but pasting NM_011667 for the identifiers (names/accessions). Obtained from http://genome.ucsc.edu/cgi-bin/hgTables May 2009. mmu_rbo_introns_Y.tab - This file contains introns from the Y-linked Jarid1d, Eif2s3y and Ube1y rat genes aligned with the mouse orthologues NM_011419, NM_012011 and NM_011667 respectively. Formatting of this file was done locally, prior to sequences being deposited in GenBank. As the published scripts are dependent on the formatting of this file the original input file used for the analysis is provided here. Zfy_int_aln_ed.tfa - This FASTA file contains the final intron of the rat zfy gene aligned with the mouse orthologue NM_009570. The rat annotation line was written locally and, as the published scripts are dependent on its structure, the original input file used for the analysis is provided here. RD_46CESCave_Sm300_081128.txt - Obtained from www.replicationdomain.org April 2009. See Hiratani, I., T. Ryba, M. Itoh, T. Yokochi, M. Schwaiger, C. W. Chang, Y. Lyou, T. M. Townes, D. Sch?beler, and D. M. Gilbert. 2008. Global reorganization of replication domains during embryonic stem cell differentiation. PLoS Biol 6:e245. RD_D3ESCave_Sm300_081128.txt - Obtained from www.replicationdomain.org April 2009. See Hiratani, I., T. Ryba, M. Itoh, T. Yokochi, M. Schwaiger, C. W. Chang, Y. Lyou, T. M. Townes, D. Sch?beler, and D. M. Gilbert. 2008. Global reorganization of replication domains during embryonic stem cell differentiation. PLoS Biol 6:e245. RD_iPSave_Sm300_081128.txt - Obtained from www.replicationdomain.org April 2009. See Hiratani, I., T. Ryba, M. Itoh, T. Yokochi, M. Schwaiger, C. W. Chang, Y. Lyou, T. M. Townes, D. Sch?beler, and D. M. Gilbert. 2008. Global reorganization of replication domains during embryonic stem cell differentiation. PLoS Biol 6:e245. RD_TT2ESCave_Sm300_081128.txt - Obtained from www.replicationdomain.org April 2009. See Hiratani, I., T. Ryba, M. Itoh, T. Yokochi, M. Schwaiger, C. W. Chang, Y. Lyou, T. M. Townes, D. Sch?beler, and D. M. Gilbert. 2008. Global reorganization of replication domains during embryonic stem cell differentiation. PLoS Biol 6:e245. ############################################################## # Output Files Both sets of scripts create and store intermediate files to enable checking, including extracted sequence files, pairs of orthologous files, aligned sequences and filtered sequences. The final datasets are created in tab-separated .txt files and .xls files. These files are named K_gene_RepTime_genic_GT.txt and K_gene_RepTime_genic_GT.xls. For clarity of publication, these files have been renamed as appropriate: * mm_rn_Ki_RT_dataset_with_filter.txt * mm_rn_Ki_RT_dataset_no_filter.txt For preservation purposes .xlsx, and .csv versions of the datasets have been manually created (they are not output by the published scripts). ############################################################## # Dataset variables Ortholog Internal reference number to enable tracking genes through data processing Mouse_Chromosome Mouse chromosomal location Rat_Chromosome Rat chromosomal location Mouse_Refseq Mouse Refseq ID Rat_Refseq Rat Refseq ID Introns_Concatenated Number of introns concatenated Mouse_Start_mm9 5' end of the coding sequence for the mouse orthologous gene based on the mm9 assembly. Mouse_End_mm9 3' end of the coding sequence for the mouse orthologous gene based on the mm9 assembly. Mouse_Start_mm8 Location of the 5' end of the coding sequence for the mouse orthologous gene on the mm8 assembly, converted from the mm9 position using the LiftOver tool. Mouse_End_mm8 Location of the 3' end of the coding sequence for the mouse orthologous gene on the mm8 assembly, converted from the mm9 position using the LiftOver tool. Rat_Start 5' end of the coding sequence for the rat orthologous gene based on the rn4 assembly. Rat_End 3' end of the coding sequence for the rat orthologous gene based on the rn4 assembly. Alignment_Length Length of the aligned intronic sequence in base pairs N Number of base pairs over which the intronic substitution rate is calculated (should equal alignment length) K_JC Intronic substitution rate correcting for multiple hits according to the method of Jukes, T. H. and Cantor, C. R. (1969) Evolution of protein molecules. IN MUNRO, H. N. (Ed.) Mammalian protein evolution. New York, Academic. K_Kimura Intronic substitution rate correcting for multiple hits according to the method of Kimura's 2 parameter model: Kimura, K. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide substitution. Journal of Molecular Evolution, 16, 111-120. K_TN Intronic substitution rate correcting for multiple hits according to the method of Tamura, K. and Nei, M. (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol, 10, 512-526. K_TK Intronic substitution rate correcting for multiple hits according to the method of Tamura, K. and Kumar, S. (2002) Evolutionary distance estimation under heterogeneous substitution pattern among lineages. Mol Biol Evol, 19, 1727-1736. raw_subst Raw intronic substitution rate of the aligned sequence with no correction for multiple hits raw_subst_per_site Raw intronic substitution rate per base pair of aligned sequence GC GC content for the aligned intronic sequence Class Type of chromosome: X, Y or A (Autosome) Number_Replication_Regions Number of replication probes averaged based on overlap with any part of the gene. Mean_RT Mean of all replication time probes from the Hiratani dataset that overlap with any part of the gene, based on the 5' and 3' ends of the coding sequence Median_RT Median of all replication time probes from the Hiratani dataset that overlap with any part of the gene, based on the 5' and 3' ends of the coding sequence Mean_RT_start Mean replication time taken from the four replicate dataset probes located closest to the 5' end of the gene Mean_RT_end Mean replication time taken from the four replicate dataset probes located closest to the 3' end of the gene Mean_RT_difference Difference in mean replication times between the 5' and 3' (used to compare variation across genes to variation across chromosomes) Median_RT_start Median replication time taken from the four replicate dataset probes located closest to the 5' end of the gene Median_RT_end Median replication time taken from the four replicate dataset probes located closest to the 3' end of the gene Median_RT_difference Difference in median replication times between the 5' and 3' (used to compare variation across genes to variation across chromosomes) A Number of adenine bases in the original mouse intronic sequence for the gene. T Number of thymine bases in the original mouse intronic sequence for the gene. G Number of guanine bases in the original mouse intronic sequence for the gene. C Number of cytosine bases in the original mouse intronic sequence for the gene. N Number of bases in the original mouse intronic sequence for the gene. GT_SKEW Extent of GT skew in the mouse intronic sequence, determined by ((G + T) - (A + C)) / (G + T + A + C)