![]() |
|
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BIOBASE GmbH, Wolfenbüttel, Germany (A.K., V.M., E.W.); Institute of Cytology and Genetics, Novosibirsk, Russia (A.K.); Fraunhofer Institute of Toxicology and Experimental Medicine, Center for Drug Research and Medical Biotechnology, Hannover, Germany (S.R., J.B.); National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina (P.N.); and Center of Pharmacology and Toxicology, Medical School of Hannover, Hannover, Germany (J.B.)
Received April 21, 2004; accepted August 26, 2004
| Abstract |
|---|
|
|
|---|
The aryl hydrocarbon receptor (AhR) is a well understood ligand-activated transcription factor that mediates responses to a variety of toxins. Among them are halogenated aromatic toxins such as 2,3,7,8-tetrachlorodibenzo-p-dioxin, polynuclear aromatic hydrocarbons, combustion products, and numerous phytochemicals such as flavonoids and indole-3-carbinol. The nuclear AhR complex is a heterodimer composed of the AhR and AhR nuclear translocator (Arnt) proteins. Binding of this complex to so-called dioxin responsive elements leads to activation of a large number of AhR-responsive genes, including phase 1 (e.g., CYP1A1) and phase 2 drug-metabolizing enzymes and also genes coding for cellular differentiation, metabolism, and apoptosis among others (Safe and Wormke, 2003
).
In-depth studies of the last years revealed many details about the mechanisms of action of the AhR transcription factor and promoter activation of target genes, including CYP1A1 (Whitlock, 1999
). Protein-DNA interactions of AhR were analyzed, and expression of target genes was studied. Despite our advanced understanding of AhR receptor biology, the precise mechanism of activation of target genes remains uncertain. There are a number of genes known to be regulated (activated or repressed) by AhR in combination with other transcription factors, including AP-1 and ER (Safe and Wormke, 2003
), but the general combinatoric rules to enable robust prediction of gene expression are lacking. Furthermore, there is a need to identify networks of cooperating transcription factors. Thus, computational approaches for the prediction of the transcriptional network of AhR target genes are needed.
In the past few years, a number of computational approaches were developed addressing the problem of combinatorial regulation of transcription by networks of transcription factors. Specific TF binding site combinations were used for an identification of muscle-specific promoters (Frech et al., 1998
; Wasserman and Fickett, 1998
), promoters of liverenriched genes (Tronche et al., 1997
), of yeast genes (Brazma et al.,1997
), of immune-specific genes (Kel et al., 1993
; Boehlk et al., 2000
; Fessele et al., 2001
), and promoters of genes regulated during cell cycle (Kel et al., 2003
). In the database TRANSCompel (Kel-Margoulis et al., 2002a
), known composite regulatory elements and specific combinations of pairs or triplets of TF binding sites that are located in proximity to each other were collected. This provides information on synergistic or antagonistic effects on gene regulation by different TFs. Nonetheless, information on TF networks is limited, and descriptions of such networks are often highly speculative. With the introduction of global gene expression arrays into the armory of biomedical research, investigators have been enabled to analyze the expression patterns of hundreds or thousands of genes altered in parallel in response to physiological as well as pathological stimuli. To gain insight into the mechanisms governing these changes, we need to acquire a better understanding of the regulatory proteins as well as the regulatory regions of the genes involved. To facilitate the expansion of this knowledge base, one suitable approach is to develop genetic algorithms for the identification and characterization of promoters and other regulatory elements. Including such genetic algorithms, we have developed a computational strategy that we used to analyze promoters of differentially expressed genes regulated by the AhR. Applying this strategy, we identified composite modules that include AhR binding sites arranged in clusters and accompanied by a number of colocalized specific DNA motifs, including hepatic nuclear factors, CCAAT/enhancer-binding proteins, and others. The revealed composite modules were used to identify novel gene targets based on whole genome searches.
| Materials and Methods |
|---|
|
|
|---|
RNA Isolation and Production of Copy RNA. Total RNA was isolated using RNeasy total RNA isolation kit (QIAGEN GmbH, Hilden, Germany) according to the manufacturer's recommendation. Ten micrograms of total RNA was used for the synthesis of double-stranded cDNA with Superscript II reverse transcriptase. Primer extension was with high-performance liquid chromatography-purified T7-(dT)24 (GenSet SA, Paris, France) as a primer. After clean up, the double-stranded cDNA was used for the synthesis of biotin-labeled cRNA (ENZO BioArray High Yield RNA Transcript Labeling kit; Affymetrix, Santa Clara, CA). cRNA was purified with RNeasy spin columns from QIAGEN and cleaved into fragments of 35 to 200 bases by metal-induced hydrolysis.
Microarray Experiments. This was done according to the manufacturer's recommendations as detailed in Singh-Gasson et al. (1999
) with n = 3 repeats. Furthermore, we used n = 15 cDNA arrays containing 302 well known genes that code for detoxification, cell proliferation, tumor development, heat shock response, signal transduction, apoptosis, and cell cycle regulation (J. Borlak, J. Drewes, K. Hofmann, and A. Bosio, submitted).
Thermocycler RT-PCR. For PCR amplification of cDNA, a 25-µl reaction mixture was prepared containing 10x polymerase reaction buffer, 1.5 mM MgCl2, 0.4 nM dNTPs (Roche Diagnostics, Mannheim, Germany), 400 nM concentration of the 3'- and 5'-specific primers (synthesized by Invitrogen, Karlsruhe, Germany), 1 U of Taq-polymerase, and 1 µl of cDNA.
PCR reactions were carried out in a thermal cycler (T3; Biometra, Göttingen, Germany) using primer-specific melting, annealing, and extension cycling conditions. DNA contamination was checked for by direct amplification of RNA extracts before conversion of RNA to cDNA and was excluded. PCRs were done within the linear range of amplification. PCR products were separated using an agarose gel, stained with ethidium bromide, and photographed on a transilluminator. A semiquantitative measurement was performed using the program NIH Image version 1.62.
Sequence Retrieval. One hundred eighteen (68 human and 50 rat) differentially expressed genes were used as training sets (Table 1). For 41 human and 37 rat genes, the level of expression was determined experimentally.
|
The Ensembl database and UCSC Genome browser was used to extract the 5' regions of the AhR-regulated genes and to build the training sets of sequences. For the majority of the genes and based on the Ensembl annotation of the first exons, 5' regions could be retrieved. The beginning of the first exon was considered as a tentative transcription start site (TSS). However, for some genes, identification of TSS was difficult. In these cases, several possible 5' regions were considered. For some of genes given in Table 1, the annotation of promoters in genomes was not reliable (based on mapping of 5' incomplete expressed sequence tags or based on an in silico gene prediction). In such cases, we did not include these genes into the training sets. Finally, 96 5' regions of these genes were included into the training set (81 human and 16 rat). Four sets of subsequences were prepared: (-500/+100), (-1000/+100), (-2000/+2000), and (-20,000-gene-+20,000). We refer to these sequences as G600, G1100, G4000, and G20gene20kb sets.
Another training subset contained 5' upstream regions for 58 genes (42 human and 16 rat) for which gene expression was quantified (Table 1). We used the logarithms of the relative expression values for further studies of correlation between promoter structure and the relative expression values. Furthermore, several sequence sets were extracted from the human genome database and used as controls. One set contained 200 promoter sequences from human genes in chromosome 21 (PR). Another set contained far upstream sequences of human genes (based on RefSeq annotation) from chromosome 21 (from -50 to -40 kb) (UPG10kb set). Another set was used for an analysis of exon 3 sequences of human genes (EXON3 set). No functional TF binding sites are within exon 3; thus, these sequences can be used as a TF "site-free" background. A further sequence set containing randomly selected intergenic sequences of chromosome 22 (INTERG set) were also used and considered in our analysis.
Databases. Two databases for prediction of gene expression were used (BIOBASE GmbH, Wolfenbüttel, Germany). TRANSFAC is a database that collects information about gene regulation in eukaryotes based on binding of transcription factors to their target sites. We used TRANSFAC Professional release 6.4 and TRANSCompel (release 6.4), which contains known composite regulatory elements in mammalian genes.
Weight Matrix Method for Recognition of AhR Sites. The most widely used methods for recognition of transcription factor binding sites is the application of positional weight matrices (PWMs) (Quandt et al., 1995
; Whitlock, 1999
). TRANSFAC is the largest collection of weight matrices for eukaryotic transcription factors (Wingender et al., 2001
) (http://www.biobase.de and http://www.gene-regulation.com). In this database, three different weight matrices for AhR sites were stored before this study (Accession nos. M00139, M00235, and M00237). In the current work, a novel weight matrix was constructed on the basis of 25 known, experimentally verified, genomic AhR-binding sites (TRANSFAC accession no. M00778; id V$AHR_Q5). All these matrices were used for searching potential AhR sites in genomic sequences. For this search, we used the MATCH algorithm calculating scores for the matches by applying the so-called information vector (Kel et al., 2003
).
Software Tools for Searching TF Binding Sites. We used the TRANSPLORER software package (BIOBASE GmbH) to identify potential TF binding sites. This software uses a PWM collection and the most up-to-date library of matrices derived from the TRANSFAC Professional database. A TF binding site search can be made using all or a particular subset of matrices from this library. Such matrix subsets with defined cutoff values for every matrix are called "profiles" in TRANSPLORER.
We used a profile that included matrices for different transcription factors of vertebrate organisms (TRANSFAC release 6.4) known to be involved in gene regulation in liver, in immune cells, and in the cell cycle. The cutoff values for the matrices are set to a very low value to minimize the rate of false negative predictions.
Method for Revealing Short Sequence Motifs in the Flanking Regions of TF Sites (Local Context). To analyze the flanking regions of AhR sites, we applied an algorithm that was developed recently (Kel et al., 1999
, 2001
). The composition of over- and under-represented short oligonucleotides (local context) is investigated in the flanking regions of functional AhR sites (50 bp to the left and right of the 11-bp-long AhR sites). This local context is taken into account while searching genomes for potential AhR sites.
The applied algorithm is based on a comparison of two sets of sequences of equal length L: a training set Y consisting of the functional AhR sites, including their flanking regions; and a negative control set N consisting of the sequences from "nonpromoter" regions of the human genome. To compile the N set, we computed a MATCH search in the INTERG sets using the AhR matrix (accession no. M00778; score value q > qcutoff = 0.98, it gives approximately 1 match per 5 kb of genomic sequence, as it was tested on the chromosome 22). One hundred of randomly selected matches together with their flanks were placed into the control set N. Thus, set N consists of sequences that contain a motif fitting to the AhR matrix; however, because of its position in the genome most of these "sites" should not be functional. This way, by comparison of sets Y and N, we can reveal features that characterize the sequence environment (context) of functional AhR sites. As contextual features, we consider the frequency of occurrence of short motifs
= (a1a2..ak) (a
{A,T,G,C, W,S,R,Y,M,K,B,V,H,D,N}1) of the length k
4 in a window w = [t1,t2] (0 < t2 < t1 < L - k + 1).
In our previous work (Kel et al., 1993
, 2001
), we described a statistical method that permits an identification of motifs
and the windows w that are characterized by a significant difference of their frequencies f(
,w,S) in the sequences S from the sets Y and N. The motifs found are used then for creating a context analyzer that is able to perform an additional filtering of the potential sites revealed by the weight matrix method.
The context analyzer is developed by means of linear discriminant analysis. The motifs selected at the previous step (
1,
2,...,
m) are used for construction of a linear classification function discriminating sets Y and N. So, for every sequence X, we calculate the score of context:
![]() | (1) |
where
i and
are the coefficients of the discriminating function.
A Genetic Algorithm to Determine Composite Regulatory Modules. We define a composite module CM as a set of TF weight matrices with given matrix cutoffs and other parameters that is associated with a specific functional type of gene regulatory regions. We have developed a new computational method to determine CMs in a set of promoter (or other regulatory) sequences of regulated genes. This method is based on a genetic algorithm (a prototype of this method is incorporated in the tool package ClusterScan (Kel-Margoulis et al., 2002b
). The CMs are characterized by the following parameters: K, the number of PWMs in the module (typically 6-12), cutoff values q(k)cutoff, relative importance values
(k) and maximum number of best matches
(k) that are assigned to every weight matrix k (k = 1,K) in the CM. These K matrices are selected by the program from a library of all considered matrices. We use different profiles, including the profile vertebrate_minFN62.prf, which includes 410 matrices for different transcription factors of vertebrate organisms (TRANSFAC release 6.4). Some matrices are organized in pairs. Such pairs are designed to model composite elements that play a very important role providing synergistic or antagonistic regulation of transcription through binding of different transcription factors. (Kel et al., 1995
). A parameter R is defined that puts a limit on a distance between matches of these matrix pairs (at least one pair of matches should be found fitting this limit). When all these parameter settings can be defined a "composite module score" (CM score) is given for any sequence X using the following equation:
![]() | (2) |
where qi(k) (X) are the
(k) best scoring sites found in the sequence X by the matrix (k). Then the genetic algorithm is used to determine the specific parameters of CMs for a particular set of promoters. The general description of the genetic algorithm was reported in Kel et al. (2001
).
We define the goal function G as a weighted sum of false negative and false positive errors and determine the statistical significance (t test) over several random iterations of bootstrap procedure by splitting the initial sequence sets into a training and a testing subset. In addition, we test the normal distribution of the F function over the set of positive and negative sequences. Such thorough way of calculating the goal function using bootstrap test and normality check allows to assess the usability of the obtained solutions for classification of individual sequences. Because of the small size of the sequence set, we decided to use the complete set for the training and apply the thorough analysis of distribution rather then to sacrifice a part of the set for independent testing. Nonetheless, the final biological and truly independent test of the method is done by the complete genome screening and by comparing results with microarray data (see Results).
The program CMFinder (cmf) takes as an input two sets of sequences (the set which is analyzed and a background set) and a set of weight matrices for transcription factors. For defined parameters K and R and over a number of iterations, the program optimizes the set of matrices selected, their number, their cutoff values, the relative importance, and the maximum number of best matches. The user can vary parameters K and R and compare results of the program. The output of the program is a profile ready to run by MATCH or TRANSPLORER.
Method to Determine CMs Correlated with Gene Expression. The method is based on a modification of the genetic algorithm described in the previous section. The program CMFinder now searches for a combination of TF matrices that minimizes a goal function G that is a weighted sum of the least squares deviation between logarithms of gene expression values and the score function FCM described in the previous section and a rank correlation coefficient:
![]() | (3) |
here, LR = log2(ex1/ex2), which is the log-ratio of the expression values of a particular gene in the experiment versus control (i.e., a signal log-ratio).
| Results |
|---|
|
|
|---|
|
The core of consensus sequence of the new matrix is "TNGCGTG", which fits exactly to the seven-nucleotide length consensus reported previously (Watson and Hankinson, 1992
; Bacsi et al., 1995
). So, with our analysis that was based on much broader collection of known AhR sites, we confirmed the well known consensus. In addition, we constructed the weight matrix that provides information on the nucleotide preferences in a broader, 11-nucleotide, window around this core consensus sequence.
Modulation of Gene Expression after AhR Activation. Microarray and RT-PCR experiments enabled an assessment of gene expression, and the results are given in the Table 1. We selected 114 genes (64 human and 50 rat) that changed their expression upon AhR activation. Among the 114 "changed" genes, we reconfirmed some of the well known targets of AhR with their known binding sites, such as CYP1A1, CYP1B1, Ugt1a1, and c-fos among others. We also confirmed some of the recently reported AhR targets, including Bcl-2, Bcl-xl, BAD, and cyclooxygenase 1, as shown in Table 1.
Analysis of Short Sequence Motifs in the Flanking Regions of AhR Sites (Local Context). To analyze the flanking regions of AhR sites, we computed the Local context as described under Materials and Methods. The analyzed sequences contained a proven AhR site in the center (11 bp in length, aligned according to the best fit to the matrix) and 50 bp flanks left and right from the site (the DNA strand orientation with regard to the TSS site was not considered). We identified 13 characteristics in these sequences (Table 2). Nine of them are over-represented in comparison with the negative control set of sequences (positive characteristics) and four underrepresented (negative characteristics). The negative control set was prepared as described under Materials and Methods by selecting from the INTERG sequences those subsequences of 111-nucleotide length in which the 11 central positions match with the AhR weight matrix (score cutoff q = 0.966).
|
As shown in Table 2, the selected motifs are three to four nucleotides in length and contain many ambiguous letters (from the IUPAC 15 letter code). The corresponding windows in which the selected motifs were found to be over-represented/underrepresented are rather different in their length and location (Table 2, second and third column). Some of the windows are located on the left flank, some on the right, and some of them overlap in the central positions.
On the basis of these 13 characteristics (m = 13), we constructed the linear classification function according to eq. 1 discriminating sets of real AhR sites from the false positives in INTERG sequences. We refer to this function as AhR score of local context d.
The advantage of the score of local context d as an additional filter to identify potential AhR sites is demonstrated in Fig. 2. As a first step, we identified the potential AhR sites by using the weight matrix with a definite cutoff value qcutoff. Then, we applied the d score to an extended region surrounding each potential AhR site. The requirement d > 0 (dcutoff = 0.0) is used as an additional criterion for recognition of AhR sites. In Fig. 2, the distributions of d values for the set of real AhR sites (red) and for the set of "false" AhR-like motifs in the intergenic regions (blue) are shown. These two distributions are clearly distinct from each other and by setting the cutoff value to d = 0.0, we can filter out almost 80% of the false sites while losing only 14% of the true positive sites. Thus, this method allows us to increase the correctness of our predictions of AhR sites in genomic sequences.
|
Searching for AhR Sites in Genomic Sequences Using Score of Context. We applied our high-precision method for recognition of AhR binding sites to the genomic sequences of genes that were shown to be regulated (up or down) after activation of AhR. It is known that functional AhR sites may be situated in the promoter regions of the AhR target genes, in their introns as well as in far upstream and downstream noncoding regions around the genes. Therefore, we analyzed the G20gene20kb set (see Materials and Methods) that includes 20-kb regions upstream and downstream of the genes. This set contains 40 sequences of the total length 1,855,876 bp.
We searched for AhR sites on either strand using the consecutive application of AhR site weight matrix and the score of context. One hundred ninety-five sites with a frequency of 0.000105072 sites per bp (or approximately 1 site per 10,000 bp) were identified with a matrix score cutoff s = 0.98 and the context score cutoff d = 0.0. Sequences of chromosome 21 with a total length of 34,011,570 bp were used for data curation. The same cutoff values were applied (s = 0.98 and d = 0.0). The total number of identified AhR sites was 1880, which corresponds to 0.00005528 sites per bp (or approximately 1 site per 18,000 bp). Therefore, the frequency of AhR sites in the AhR-regulated promoters is almost 2 times higher compared with overall sequences of the entire chromosome 21.
To check whether the revealed sites have any preferences in their location relative to the start of transcription (TSS position), we calculated the frequencies of predicted AhR sites in different regions around TSS of AhR-regulated genes. It is striking that, for some windows of limited length around the TSS, the frequency of AhR sites is much higher than in the other regions of regulated genes as well as in the whole chromosome 21 (Fig. 3). For instance, within the window [-1000 TSS + 1000] (-1000 upstream and +1000 downstream from the start of transcription), we observed AhR sites with a frequency of 0.0005125 sites per bp (or approximately 1 site per 2000 bp), which is almost 10 times higher than in the whole chromosome 21.
|
Although the frequency of AhR sites is very high near the start of transcription, it is not the only place where binding sites can be found. They can be found in the promoter regions of AhR-regulates genes (in 5' proximity to the start of transcription), inside the first noncoding exon, and in introns as well as in the far 5' and 3' regions of the genes. A very distinct feature of the location of AhR sites is a preponderance to form clusters. These clusters may contain from k = 4 to 5 up to 10 to 15 sites and the extent of the clusters varies from l = 50 bp up to 1000 bp (see examples in the Fig. 4).
|
We computed the probability of a given (k,l) cluster to occur just by chance in a sequence of length L given the total number of sites K in this sequence. The probability is calculated according to the algorithm described previously (Kel et al., 2001
). We found statistically significant AhR-site clusters in about 40% of AhR-regulated genes. Among them are the following genes: ALDH3A1(human), six sites inside 66 bp, from +1540 to +1606; BAD(human), six sites inside 338 bp, from -7519 to -7181; and NR3C1 (human), six sites inside 765 bp, from -893 to -128) and many others.
Composite Modules Revealed in Promoters of AhR-Regulated Genes. To reveal novel composite modules, the local nucleotide sequence of promoter regions of AhR-regulated genes was analyzed. Several subregions around TSSs were studied [e.g., (-2000/+2000), (-1000/+100), and (-500/+100)], and we applied different parameters for searches with the CompositeModuleFinder. We compared the structure of AhR-regulated promoters to a set of promoters randomly taken from the human genome. An extended table with the results of this analysis is available on request. Among the most discriminative matrices were those recognizing HNF1, AhR, GR, OCT, and C/EBP (Table 3). It is interesting that the AhR matrices selected by the algorithm were in the set of promoters (-1000/+100) only. The algorithm did not select shorter and longer promoter sequences. This indicates that not only the presence of a binding site is important but also its relative position within the promoter.
|
Among the most prominent matrix pairs that were selected by the algorithm are HNF1/Sp1 and AP-1/NF-1 for the maximum distance r = 100 bp; E2F/NF-1, AhR/Myb, HNF3/NFY, HNF6/NF-
B, and Sp1/Myb for the maximum distance r = 50 bp; and HNF-1/GR for the maximum distance r = 40 bp. It is noteworthy that not all matrices found individually were also found in pairs. Thus, a composite module for promoters could be defined.
In Table 4, a combination of individual matrices and matrix pairs for a set of promoters of AhR regulated genes (-2000/+2000) are listed. One matrix (YY1) and two pairs (E2F/ROR and AhR/CREB) were selected by the algorithm, because their frequencies in the AhR-responsive promoters were lower than in others. All other matrices and pairs of matrices were selected to define an AhR-associated composite regulatory module CMAhR in the promoters investigated.
|
As given in Fig. 5, the distribution of the composite module score FCM is distinctly different (t test value is 12.12, p value = 1.7 x 10-28) and demonstrates robust recognition of AhR-responsive promoters. A summary of sites and site pairs from CMAhR found in AhR-regulated genes is given in Table 5.
|
|
The composite structure of an AhR-regulated promoter, e.g., the human ALDH3A1 gene, is given in Fig. 6. Identified sites, site pairs, and the AhR site cluster are all distributed over the whole region of 4 kb around the TSS. Comparison of this region between different species reveals that only the proximal promoter is conserved. There were no detectable homologies between human and mouse/rat in the other parts of this 4-kb region. It seems that the sites predicted in those parts are specific for the human gene only. In previous experimental studies, several AhR sites as well as NF-1 and GR sites were detected in different 5' upstream regions of rat and mouse ALDH3A1 gene (Vasiliou et al., 1999
). Here, we confirmed that the proximal promoter of this gene has a typical AhR-regulation-associated structure.
|
Prediction of Gene Expression Based on the Composite Modules. Next, we addressed the differences in the structure of the regulatory regions between up- and down-regulated genes. The dynamics of regulation is actually quite different. Therefore, we aimed to identify transcription factors that influence the dynamics of up- or down-regulation in response to AhR protein-DNA interaction.
We thus compared two sets of promoters: for the up-regulated genes ExpressUP and for down-regulated genes ExpressDOWN (taking the regions of promoters from -2000 to +2000). The relative expression level was considered as additional information for this analysis.
With the help of the Composite Module Finder Express, we determined a set of matrices that correlates with gene expression in the two sets of regulated genes (Table 6). Patterns with positive "impact" values
are over-represented in promoters of up-regulated genes (shown in bold). Patterns with negative impact values are over-represented in promoters of down-regulated genes (underlined).
|
A linear function is computed (see Materials and Methods, eq. 2) that predicts the level of up- or down-regulation of a gene based on the sequence of its promoter. This function takes into account impact values for selected patterns given in Table 6. We obtained a significant correlation when the logarithm of the experimentally determined gene expression levels were plotted against predicted ones (Fig. 7).
|
Identification of Genomic Target for the AhR. The study of promoters of AhR-regulated genes and the hierarchy of AhR sites enabled us to develop a definition of a novel genetic algorithm. The rules of this algorithm are given below and were applied to search for new AhR targets within the human genome.
|
| Discussion |
|---|
|
|
|---|
The Concept of the CMs. Functionally related genes involved in the same molecular-genetic, biochemical, or physiological process are often regulated coordinately by specific combinations of transcription factors. Dynamic function-specific complexes of many different transcription factors, so called enhanceosomes (Merika and Thanos, 2001
), are formed at gene promoters and enhancers driving gene expression in specific manners. At the level of DNA, the blueprints for assembling of such variable TF complexes on promoter regions may be seen as specific combinations of TF binding sites located in proximity to each other. We call such structures "composite regulatory modules". The various types of composite modules can be classified according to their ability to regulate gene expression. There may be several different types of CMs located in the regulatory region of one gene, that can be spaced (e.g., liver specific and muscle-specific enhancers of one gene) or overlapping. We also consider the hierarchy of the CMs based on the number of different TFs involved and the length of the sequence covered. Composite elements (CEs) consisting of two/three closely located sites belong to the lowest hierarchical level of CMs. CMs of a higher level may include more sites as well as CMs of a lower level such as CEs.
A number of approaches identifying composite motifs have been described: BioProspector (Liu et al., 2001
), CoBind (GuhaThakurta and Stormo, 2001
), MITRA (Eskin and Pevzner, 2002
), and dyad search (van Helden et al., 2000
). These programs help to discover new regulatory sites for yet unknown transcription factors, but an "ab initio" motif finding method is limited by the length of sequences and may not be suitable for the analysis of long regulatory regions of genes of human and other higher eukaryotic organisms. A valuable source to identify transcription factor binding sites is the TRANSFAC database (http://www.biobase.de) (Wingender et al., 2001
).
We developed a new method for detecting de novo composite modules using information from TRANSFAC and the weight matrices for many different transcription factor families. It is a rather complex computational task to find combinations of weight matrices that are characteristic for a particular group of genes. Many variables are included in the model, such as the number of matrices in a module, the maximal number of matches considered, the distances between matches, and cutoff values for the matrices and their relative impact on the composite score. To find an optimal structure of the model we applied a genetic algorithm approach that was shown to be efficient in finding solutions. One of the reasons to use a genetic algorithm was the need to perform simultaneous fitting of the modeled parameters and to reduce the number of parameters. Because of the large number of possible matrix combinations it is impractical to use canonical multidimensional regression (linear or logit) without prior reduction of the number of parameters. Of course, when selecting a heuristic algorithm for optimization one must be aware that it finds only local optima and solutions can vary in different runs of the program. The selection of a proper goal function is very important for this approach. We constructed a robust goal function using some ideas of "utility calculations" (Kel et al., 1993
). The requirement for normality and bootstrap procedures (see Materials and Methods) provides the correct environment to find composite modules that describe the whole set of promoters and not just a particular subset.
The success of applying such tools greatly depends on how well the general model of the promoter we are optimizing actually fits to reality. Because of the limitations of data here, we consider only relatively simple models. To improve the model in future, we will consider such additional parameters as mutual distances between each site in the CM, their orientations to each other, and location relative to the start of transcription. All these and many other parameters could be easily incorporated in the model and optimized by the help of the genetic algorithm.
The algorithm that we applied identified several factors, including, SP1, NF-
B, and C/EBP. It is interesting that the AhR matrix alone was insufficient for the description of AhR-regulated promoters. Thus, specific combinations with other factors provided the best discrimination for detecting AhR-regulated genes.
We also performed computational tests with the cmf program on virtual data. We generated a set of random sequences by shuffling randomly nucleotides in every sequence of the G4000 set. Then, we implanted several binding sites of certain transcription factors by placing them at random. We used the cmf program to compare the sequence set with the implanted sites and those selected at random. We varied the minimal scores of the sites and their frequencies to find the borders. The result of this simulation (Table 8) shows that the cmf program was able to determine correctly all TF matrices used for site implantation, although the more matrices were used for implantation the more difficult it became for the program to reveal the sites correctly. It is noteworthy that matrices were found to behave differently in the test. For instance, the matrices for C/EBP and OCT were easily identified whereas recognition of the matrix for AhR factors was more difficult, and the AP-1 matrix was most difficult to identify. Similarity between different matrices in the library is an obvious confounding factor. Unfortunately, the algorithm can not identify different members belonging to the same TF family. For example, in the case of C/EBP, identification of a particular member (e.g., C/EBP-
, -
, or -
) was impossible. Nevertheless, specific combinations of transcription factors are indicative of a particular regulatory mechanism.
|
Another test performed was the "knocking-out" of those matrices most important for AhR promoter activation (i.e., distortion of six matrices: AhR, C/EBP, OCT, AP-1, NF1, and HNF1) by exchanging values for nucleotides A versus T and C versus G. After such distortion (without changing the C/G content), the program was run again with complete library where these six matrices were "knocked-out". No significant differentiation could be identified between AhR-regulated promoters and other promoters. This proved the specificity of our algorithm and indicated that the matrices we identified were indeed important for recognition of the AhR-regulated genes.
Our composite module contained matrices for AhR, PPAR, HNF-6, STAT, ROR, and ETS. It is interesting that three different AhR matrices were included by the algorithm in the CM. This suggests that sites for AhR in the promoters play an important role in the up- or down-regulation of these genes. V$AHR_01 got the maximal impact value. It seems to be specific for several cytochrome P450 genes and influences their expression in response to AhR. V$AHRARNT_01 was constructed on the basis of data from SELEX experiments, whereas V$AHR_Q5 was constructed in the current study on the basis of genomic sites. These two matrices have very different impact values: the first matrix has a positive and the second matrix a negative impact value. Comparison of the structure of these two matrices shows that they are very similar in the core but differ in some nucleotides at the flanks. For example, in the position 10 in the matrix V$AHR_Q5 (Fig. 1), the most prominent nucleotide is G, whereas in the matrix V$AHRARNT_01 (see TRANSFAC) nucleotide G in the corresponding position is absolutely "for-bidden". This could influence binding of some other factors such as repressors in the vicinity of AhR sites.
The combination of matrices and matrix pairs included in the CMs revealed a network of transcription factors governing the AhR-dependent gene regulation. Figure 8 depicts this network. Some of the genes of the transcription factors included in the CM, such as C/EBP-
, HNF1-
, HNF3, HNF4-
, OCT-1, are regulated by AhR as shown previously (Borlak et al., 2002
, 2003
). Analysis of their promoters using the CM points to an intriguing network of transcription factors with several feedback loops and a hierarchy of regulatory signals. This network of transcription factors can also explain the regulation of several genes that are not direct targets of AhR binding. Their regulation can be mediated through other TFs whose expression is regulated directly by AhR (Fig. 8).
|
On the genome scale, to search for potential new AhR target genes, we applied very stringent set of criteria to avoid high rate of false positives. In addition to the high score of the CM match, we require AhR sites with the proper oligonucleotide context to be found in the promoters as well as presence of clusters of AhR sites. With such stringent criteria, we definitely miss the indirect targetsgenes whose AhR-dependent regulation is mediated through other TFs. As a result, a high proportion of true positive prediction is to be expected as it is confirmed by comparison with the results of microarray studies.
In conclusion, a computational method was developed for an ab initio identification of AhR-regulated genes. The applied algorithm revealed a specific motif within a distinct promoter region near the AhR sites. This finding enabled us to identify clusters of regulatory proteins. It is noteworthy that the genetic algorithm predicts the level of transcript expression and may be used for the identification of genes regulated by AhR.
| Acknowledgements |
|---|
| Footnotes |
|---|
ABBREVIATIONS: TF, transcription factor; AhR, aryl hydrocarbon receptor; Arnt, aryl hydrocarbon receptor nuclear translocator; Ah, aryl hydrocarbon; RT-PCR, reverse transcription-polymerase chain reaction; PCR, polymerase chain reaction; TSS, transcription start site; kb, kilobase(s); bp, base pair(s); PWM, positional weight matrice; CM, composite regulatory module; CE, composite element; NF-
B, nuclear factor-
B.
1 We are using the following one-letter code for different combinations of alternative nucleotides: W, (A/T, read A or T); R, (A/G); M, (A/C); K, (T/G); Y, (T/C); S, (G/C); B, (T/G/C);V, (A/G/C); H, (A/T/C); D, (A/T/G); and N, (A/T/G/C). ![]()
Address correspondence to: Prof. Dr. Jürgen Borlak, Fraunhofer Institute of Toxicology and Experimental Medicine, Center for Drug Research and Medical Biotechnology, Nikolai-Fuchs-Str. 1, D-30625 Hannover, Germany. E-mail: borlak{at}item.fraunhofer.de
| References |
|---|
|
|
|---|
Bacsi SG, Reisz-Porszasz S, and Hankinson O (1995) Orientation of the heterodimeric aryl hydrocarbon (dioxin) receptor complex on its asymmetric DNA recognition sequence. Mol Pharmacol 47: 432-438.[Abstract]
Borlak J, Dangers M, and Thum T (2002) Aroclor 1254 modulates gene expression of nuclear transcription factors: implications for albumin gene transcription and protein synthesis in rat hepatocyte cultures. Toxicol Appl Pharmacol 181: 79-88.[CrossRef][Medline]
Borlak J, Hock A, Hansen T, and Richter E (2003) DNA adducts in cultures of polychlorinated biphenyl-treated human hepatocytes. Toxicol Appl Pharmacol 188: 81-91.[CrossRef][Medline]
Brazma A, Vilo J, and Ukkonen E (1997) Finding transcription factor binding site combinations in yeast genome, in Computer Science and Biology (Frishman D and Mewes HW eds) pp 57-59, Proceedings of the German Conference on Bioinformatics GCB-97, Martinsried, Germany.
Dunn JC, Tompkins RG, and Yarmush ML (1991) Long-term in vitro function of adult hepatocytes in a collagen sandwich configuration. Biotechnol Prog 7: 237-245.[CrossRef][Medline]
Eskin E and Pevzner PA (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics 18 (Suppl 1): S354-S363.[Abstract]
Fessele S, Boehlk S, Mojaat A, Miyamoto NG, Werner T, Nelson EL, Schlondorff D, and Nelson PJ (2001) Molecular and in silico characterization of a promoter module and C/EBP element that mediate LPS-induced RANTES/CCL5 expression in monocytic cells. FASEB J 15: 577-579.
Frech K, Quandt K, and Werner T (1998) Muscle actin genes: a first step towards computational classification of tissue specific promoters. In Silico Biol 1: 29-38.[Medline]
Guha Thakurta D and Stormo GD (2001) Identifying target sites for cooperatively binding factors. Bioinformatics 17: 608-621.
Kel A, Kel-Margoulis O, Babenko V, and Wingender E (1999) Recognition of NFATp/AP-1 composite elements within genes induced upon the activation of immune cells. J Mol Biol 288: 353-376.[CrossRef][Medline]
Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, and Wingender E (2003) MATCH: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res 31: 3576-3579.
Kel AE, Kel-Margoulis OV, Farnham PJ, Bartley SM, Wingender E, and Zhang MQ (2001) Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors. J Mol Biol 309: 99-120.[CrossRef][Medline]
Kel AE, Ponomarenko MP, Likhachev EA, Orlov YL, Ischenko IV, Milanesi L, and Kolchanov NA (1993) SITEVIDEO: a computer system for functional site analysis and recognition. Investigation of the human splice sites. Comput Appl Biosci 9: 617-627.
Kel OV, Romaschenko AG, Kel AE, Wingender E, and Kolchanov NA (1995) A compilation of composite regulatory elements affecting gene transcription in vertebrates. Nucleic Acids Res 23: 4097-4103.
Kel-Margoulis O, Kel AE, Reuter I, Deineko IV, and Wingender E (2002a) TRANSCompel: a database on composite regulatory elements in eukaryotic genes. Nucleic Acids Res 30: 332-334.
Kel-Margoulis OV, Ivanova TG, Wingender E, and Kel AE (2002b) Automatic annotation of genomic regulatory sequences by searching for composite clusters. Pac Symp Biocomput 187-198.
Liu X, Brutlag DL, and Liu JS (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput 127-138.
Merika M and Thanos D (2001) Enhanceosomes. Curr Opin Genet Dev 11: 205-208.[CrossRef][Medline]
Quandt K, Frech K, Karas H, Wingender E, and Werner T (1995) MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res 23: 4878-4884.
Safe S and Wormke M (2003) Inhibitory aryl hydrocarbon receptor-estrogen receptor alpha cross-talk and mechanisms of action. Chem Res Toxicol 16: 807-816.[Medline]
Singh-Gasson S, Green RD, Yue Y, Nelson C, Blattner F, Sussman MR, and Cerrina F (1999) Maskless fabrication of light-directed oligonucleotide microarrays using a digital micromirror array. Nat Biotechnol 17: 974-978.[CrossRef][Medline]
Tronche F, Ringeisen F, Blumenfeld M, Yaniv M, and Pontoglio M (1997) Analysis of the distribution of binding sites for a tissue-specific transcription factor in the vertebrate genome. J Mol Biol 266: 231-245.[CrossRef][Medline]
van Helden J, Rios AF, and Collado-Vides J (2000) Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res 28: 1808-1818.
Vasiliou V, Reuter SF, Williams S, Puga A, and Nebert DW (1999) Mouse cytosolic class 3 aldehyde dehydrogenase (Aldh3a1): gene structure and regulation of constitutive and dioxin-inducible expression. Pharmacogen 9: 569-580.
Wasserman WW and Fickett JW (1998) Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 278: 167-181.[CrossRef][Medline]
Watson AJ and Hankinson O (1992) Dioxin- and Ah receptor-dependent protein binding to xenobiotic responsive elements and G-rich DNA studied by in vivo footprinting. J Biol Chem 267: 6874-6878.
Whitlock JP (1999) Induction of cytochrome P4501A1. Annu Rev Pharmacol Toxicol 39: 103-125.[CrossRef][Medline]
Wingender E, Chen X, Fricke E, Geffers R, Hehl R, Liebich I, Krull M, Matys V, Michael H, Ohnhäuser R, et al. (2001) The TRANSFAC system on gene expression regulation. Nucleic Acids Res 29: 281-283.
This article has been cited by other articles:
![]() |
E. Wingender The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation Brief Bioinform, July 1, 2008; 9(4): 326 - 332. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Wu, L. Zhang, M. S. Hoagland, and H. I. Swanson Lack of the Aryl Hydrocarbon Receptor Leads to Impaired Activation of AKT/Protein Kinase B and Enhanced Sensitivity to Apoptosis Induced via the Intrinsic Pathway J. Pharmacol. Exp. Ther., January 1, 2007; 320(1): 448 - 457. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Waleev, D. Shtokalo, T. Konovalova, N. Voss, E. Cheremushkin, P. Stegmaier, O. Kel-Margoulis, E. Wingender, and A. Kel Composite Module Analyst: identification of transcription factor binding site combinations using genetic algorithm. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W541 - W545. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Kel, T. Konovalova, T. Waleev, E. Cheremushkin, O. Kel-Margoulis, and E. Wingender Composite Module Analyst: a fitness-based tool for identification of transcription factor binding site combinations Bioinformatics, May 15, 2006; 22(10): 1190 - 1197. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. C. Volz, D. E. Hinton, J. M. Law, and S. W. Kullman Dynamic Gene Expression Changes Precede Dioxin-Induced Liver Pathogenesis in Medaka Fish Toxicol. Sci., February 1, 2006; 89(2): 524 - 534. [Abstract] [Full Text] [PDF] |
||||