Molecular replacement with Phaser
From Media Wiki
Phaser is a program for phasing macromolecular crystal structures with maximum likelihood methods. It has been developed by Randy Read's group at the University of Cambridge and is available through the Phenix and CCP4 software suites. General information is available on the Phaser website. In particular, questions that are not answered by this document may be answered by the Phaser FAQ section of that website. As well, tutorials for carrying out molecular replacement calculations (including links to the necessary data) can be found in the Phaser tutorials page on that website.
This section describes the use of Phaser for molecular replacement. The details refer to Phaser version 1.3 (part of the CCP4 6.0 release), but there are only minor changes in version 2.1 (to be released with CCP4 6.1). Phaser can also be used for experimental phasing by single-wavelength anomalous diffraction (SAD) and by a combination of SAD and molecular replacement. Experimental phasing with Phaser is described on a separate page.
All molecular replacement programs have certain features and concepts in common: they use a 3D atomic model obtained from a known structure (often called a "template") to solve the crystal structure of an unknown structure (often called the "target"), by seeing how well the measured diffraction data agree with data computed from the model. However, the different programs differ in some of the details and underlying concepts. The concepts particular to Phaser are discussed here.
Likelihood measures the agreement of the model with the data by using probabilities. The likelihood is defined as the probability that the data would have been measured, given the information contained in the model. For molecular replacement, the "model" being tested includes not only the structure of the template, but also the orientation and/or position of that template in the unit cell of the target, as well as parameters describing the sizes of different sources of error. In other applications (such as structure refinement), the effect of model errors on the ability to predict the diffraction data can be determined by comparing the observed and calculated structure factor amplitudes. In molecular replacement, the structure factor comparison cannot be done until after the molecular replacement problem is solved, so the impact of errors has to be estimated before the molecular replacement calculation.
The most important sources of error in predicting the diffraction data come from differences in the atomic coordinates of the template and the target, and from incompleteness in the template, which may be missing whole domains or even whole molecules from the target. Before determining the structure, it is usually straightforward to estimate the completeness of the template (which is why Phaser has to be given information about the content of the asymmetric unit of the target crystal). However, to estimate the effect of coordinate error, Phaser needs an estimate of the size of coordinate errors. This is usually obtained by exploiting a relationship between sequence identity and RMS error, determined by Chothia and Lesk from a comparison of the structures of related proteins.
In molecular replacement, components of the structure are typically treated as rigid bodies that are rotated and translated to place them correctly in the unit cell of the target structure. These components are often whole proteins or even whole complexes, but for flexible proteins they may be smaller pieces such as domains or subdomains.
In Phaser, each type of component that will be treated as a rigid body is referred to as an ensemble. The term "ensemble" is used because it is possible to describe a component as an ensemble of alternative models from homologous structures, even though there will often be just one structure in an ensemble. The members of the ensemble must be superimposed on each other into a common orientation. For each ensemble, Phaser computes a set of statistically-weighted average structure factors (weighted according to expected error, and taking account of correlations between pairs of models), which is used in the molecular replacement calculations. Models should only be grouped into an ensemble if they superimpose reasonably well on each other, otherwise the average density will be too diffuse to be of much use for molecular replacement. In they superimpose poorly, it is better to search with the individual models as separate alternatives. If there are domain movements, it may be better to search separately for the individual domains.
Note that only one ensemble should be defined for each type of component; it is straightforward to search for several copies of an ensemble, and it only wastes computer time and memory to define the ensemble more than once.
 Running Phaser for Automated molecular replacement
Most molecular replacement problems that can be solved with Phaser can be solved with the automated search mode. By default, the ccp4i interface to Phaser comes up in this mode, though other modes (discussed below) can be chosen from the Mode pulldown. In an automated search, Phaser runs several individual modes in sequence: anisotropy correction (to remove overall anisotropy from the diffraction data), cell content analysis, rotation search (likelihood-based fast rotation function followed by rescoring of the top peaks using the rotation likelihood function), translation search (likelihood-based fast translation function followed by rescoring of the top peaks using the translation likelihood function), packing check (testing for overlaps and choosing symmetry-related copies of molecules to create a tightly packed assembly), and rigid-body refinement (which also prunes duplicate solutions from the list). The automated search can also search for more than one copy of a molecule or more than one different type of molecule, and it can test different possible choices of model or space group.
 Define data folder
The "Define data" folder of the interface is used to select an MTZ file containing the diffraction data for the target (unknown) structure. After choosing an MTZ file, you may need to change the default choices of the columns for the amplitude (F) and its standard deviation (SIGF). The default choice of resolution limits (limited to 2.5A for crystals that diffract to higher resolution) usually works well. In some special circumstances, you could use data to higher resolution (e.g. when the search model is small) or to lower resolution (e.g. when the search model is very large, or if you suspect that there is a large difference in overall B-factor between a molecule that has already been placed and the one currently being searched for). If there is some uncertainty about the space group, you can choose to carry out the first translation search over several, or even all, space groups with the same Laue symmetry. Finally, if the space group is one of an enantiomorphic pair (e.g. P3121/P3221), you will want to test both enantiomorphs.
 Define ensembles folder
The "Define ensembles" folder allows you to define the template models used for molecular replacement. You should define an ensemble for each type of rigid body component that you will be searching for, but remember that you can search for multiple copies of the same ensemble without defining the ensemble more than once. For example, if you were solving the structure of an AB5 toxin structure, you could define two ensembles -- one for the A-subunit and one for the B-subunit -- then search for one copy of A and 5 copies of B. If you wish to test alternative possible models for a component, each alternative should be defined as a separate ensemble. Ensembles are usually specified as atomic models in PDB files, but it is also possible to provide an MTZ file containing structure factors corresponding to the density for the search model. This allows you, for instance, to use experimental density from one crystal form to solve the structure of another crystal form, or to solve a crystal structure at low resolution using electron density from an electron microscopic image reconstruction. Note that, if you specify an ensemble by using an MTZ file, Phaser needs to be told the things that it would normally deduce from the PDB file, i.e. the extent (size of a rectangular box containing the electron density), the centre of the ensemble and the content of the ensemble (corresponding molecular weight of protein and nucleic acid). Finally, to compute the likelihood function, Phaser needs an estimate of how accurate the model is (so that it knows how precisely the structure factors can be predicted). For an atomic model, this can be specified either as an RMS error or as the sequence identity of the template to the target (from which the RMS error can be guessed using an equation derived by Chothia & Lesk). For a density (MTZ) model, a guess of the RMS error that would give equivalent errors in the electron density should be supplied.
 Search details folder
The "Search details" folder specifies which ensembles should be searched for, how many copies of each, and in what order. By default, Phaser expects only one choice for each ensemble, but this can be changed by turning on the "allow search with alternative ensembles" option, in which case a number of choices can be given for each component in the search. In this folder, you can also change the defaults for how Phaser chooses which orientations to keep for subsequent translation searches, which translations should be kept, and the criteria for unacceptable crystal packing. The defaults usually work well, but in difficult molecular replacement problems it may be useful to select more orientations (e.g. by setting the rotation search to keep orientations above 65% of the maximum, instead of the default of 75%).
 Composition folder
The "Composition" folder is needed because, in order to determine how well the structure factors can be predicted by the model, Phaser needs to know how complete the model of the crystal structure is. You can be lazy and tell it to guess that half of the asymmetric unit is occupied by protein, but it is better to provide the sequences of the proteins and/or nucleic acids making up the crystal, using files in FASTA format. (In fact, Phaser reads the sequence file and interprets any line that does not start with a ">" symbol as the sequence in one-letter code.)
Hopefully a single run of Phaser will solve the structure even if there are several molecules in the asymmetric unit, but in difficult cases it may be preferable to search for one molecule at a time, examine the output, then submit another run looking for the next molecule. This is done by providing the .sol file from a previous run of Phaser in the "Define search sets" folder. If you wish, you can edit the .sol file to comment out (with "#" characters) any potential partial solutions that you do not wish to test in the search for subsequent molecules.
 Running Phaser in other modes
It is most convenient to use Phaser in the automated mode, but in difficult cases it can be helpful to run the individual steps separately. For this purpose, the other modes can be chosen with the Mode pull-down at the top of the interface.
 Separate rotation search
By default, this mode will give the same rotation search that would be carried out in the automated mode. The results will be a series of rotations recorded in a .rlist file, which can be provided as input to a translation search (discussed below). In special cases, this mode can be used to select the "brute rotation function" from the pull-down at the end of the Mode line. The main use for the brute rotation function is to search for orientations near a particular set of Euler angles, when you know the approximate orientation of the model. This is most likely to be the case when you are searching for a component of a flexible molecule, knowing the orientation of one component. For instance, if you were solving the structure of a flexible two-domain protein, you could search for the larger domain first. If a convincing solution were found, you would know that the smaller domain is likely to be in a similar relative orientation, so you could restrict the orientation to angles within (say) 30 degrees of the orientation for the larger domain. Most or all of these orientations could be kept for a separate translation search. To use the "search around" option, select "around an angle" from the Search pull-down in the Search details folder.
 Separate translation search
Again, this is very similar to the translation search carried out as part of the automated search. As noted above, orientations from a rotation search are read from a .rlist file, specified in the Search details folder.
 Separate refinement and phasing and packing modes
These modes are provided to make the functionality available outside the automated search mode, when separate rotation and translation searches are carried out.
 Anisotropy correction
In the other modes, Phaser carries out an anisotropy correction and uses the corrected data internally, but does not write the corrected F/SIGF values into the output MTZ file. In this mode, the corrected values are written to the output MTZ file (appending "_ANO" to the column names). This can be useful, e.g. to correct data for procedures such as Patterson map calculation that do not carry out an internal correction for anisotropy. However, you should not use the corrected data for refinement, as refinement programs such as Refmac5 or phenix.refine will do a better job of anisotropy correction by comparing the observed and calculated structure factors.
 Cell content analysis
It can be useful to run this mode first to get an idea of how many copies of the assembly are expected in the asymmetric unit. For the composition, you should specify the stoichiometry of one assembly, then the cell content analysis will report on the relative probability of observing one or several copies of this assembly, using data compiled by Rupp and Kantardjieff.
 Normal mode analysis
In a case where you suspect that the molecule is flexible, the first option should be to search for individual rigid domains, at least if the domain boundaries are reasonably obvious. If this does not work, then it can be useful to perturb the structure using the normal modes, then to carry out a search testing the different perturbed structures as alternative models.
 Output files
 Log file
The most important details of the log file are marked up with summary tags. These parts appear in red in the ccp4i log file viewer, or can be viewed on their own by pressing the "Show Summary" button. The important results from the different modules of an automated search are summarized in the following.
 Anisotropy correction
Phaser uses a likelihood target to refine an anisotropy correction. Once the anisotropy correction has been applied, the intensities should fall off equally in all directions in the diffraction pattern. (A better correction can be done once there is an atomic model, which is why it is better to let refinement programs carry out their own correction on the uncorrected data.) At the end of the analysis, Phaser reports the size of the correction along three principal axes of a thermal ellipsoid and reports an "anisotropic deltaB", which is the difference between the biggest and smallest components of the thermal ellipsoid. You can get a feel for the size of the effect of the anisotropy correction at the resolution limit by noting that intensities in the direction where the diffraction is weakest will be scaled up by a factor equal to exp(2deltaB/(4dmin2))=exp(deltaB/(2dmin2)), relative to the intensities from the strongest direction. For example if the anisotropic deltaB is 30A2 and your crystal diffracts to 3A resolution, the intensities in the weakest direction are scaled up by a factor of more than 5 compared to the intensities in the strongest direction. This therefore constitutes a significant level of anisotropy.
 Cell content analysis
Phaser compares the content you have specified for the asymmetric unit (typically by giving sequence files for the different components and saying how many copies of each component are present) with the average content determined from an analysis by Kantardjieff and Rupp. You should look at how your content compares to the frequency distribution of previously observed contents, to see whether you should consider other possibilities for the number of copies.
 Fast rotation function
After the fast rotation function is carried out, the top orientations are rescored with the rotation likelihood function, then a cluster analysis is carried out to find the unique peaks. By default, only orientations with log-likelihood values that are above 75% of the mean are reported. Two scores are given for each orientation: the log-likelihood-gain (LLG) and the Z-score.
The LLG indicates how much better the data can be predicted from the oriented model than from a random-atom model. There are two things you can learn from the LLG. First, it should be positive, otherwise your oriented model is worse than a random-atom model! If it is negative, something is wrong: your model might be much worse than expected (e.g. there is an unmodelled hinge motion between domains, or the fold is less well preserved than one expects from the sequence identity), or it is less complete than expected (e.g. there is a second copy in the asymmetric unit). Second, the absolute value of the LLG can be used to compare the quality of different models against the same data. If you are testing different choices of model, the best one should give the highest LLG. If you are adding new information to the model (e.g. translation information for an oriented model, second subunit), the LLG should increase at each step.
The Z-score is computed as the LLG minus the mean LLG for a random sample of orientations, divided by the RMS deviation of a random sample of LLG values from the mean. In other words, it tells you the number of standard deviations above the mean for a particular LLG score. Z-scores for correct orientations can be relatively low in difficult cases (e.g. less than 4), but a Z-score above 5 is usually correct.
In addition, some sense of the significance of the solution can be gained from the number of orientations accepted for a subsequent translation search. If there is only one orientation above 75% of the maximum, then there is a very good chance it is correct.
 Fast translation function
A translation search is carried out for each orientation found by the rotation search. As for the rotation search, the top solutions from the fast search are rescored with the translation likelihood function, then a cluster analysis is carried out to find the unique peaks. The LLG and Z-score values have the same meaning as for the rotation search. However, one usually expects a higher Z-score value for a correct translation than for a correct orientation in the rotation search. If you are searching for a single copy of a single molecule, the correct translation will typically have a Z-score greater than 8. If you are searching for the first of several copies, then the Z-score will be lower, but is still unlikely to be less than about 6. One exception is searches for the first molecule in monoclinic space groups (e.g. P21, C2), where the translation search is carried out over only a plane. It is not uncommon for the Z-score to be relatively low in such cases. However, you would still hope to see that there is only one unique translation peak above the default threshold (75% of the maximum found in the search).
There is an exception to the rule that translations with Z-scores above 8 are usually correct. In the (fairly common) case of crystals with translational non-crystallographic symmetry, any pair of molecules in a similar orientation and separated by the correct NCS translation vector will give a large Z-score, even if the solution is incorrect. You can tell if there is translational NCS by looking at a native Patterson map; if there is a non-origin peak greater than about 15-20% of the origin peak, then you probably have translational NCS.
Each accepted rotation/translation solution is tested to see if it can be packed without serious clashes. The main thing you want to see in this section is that the packing check has not eliminated a solution with a much higher LLG score than solutions that survive the packing check. If it has, then you should be very wary of the final solution. In some cases (particularly with several molecules in the asymmetric unit), the discarded solutions are indeed incorrect, but when additional molecules are placed, the packing check does not eliminate the solutions with the highest LLG scores.
Each potential solution is subjected to rigid-body refinement, and then the solutions are pruned to remove any that are equivalent (after considering crystallographic symmetry and possible changes of origin).
 PDB files
By default, Phaser produces a PDB file for only the top solution found. The number of PDB (and MTZ) files produced can be changed by specifying the "Number of top solutions" in the "Define data" folder. When there is more than one PDB file in an ensemble, only a single representative for that ensemble will be placed in the output PDB file, i.e. the one with the lowest estimated RMS error.
 MTZ files
The MTZ file produced by the automated search (or by a separate translation, refinement or rescoring run) contains the data from the input MTZ file, as well as the following new columns:
- FC/PHIC: Amplitude and phase computed from molecular replacement solution (including probability-weighted averages for ensembles)
- FWT/PHWT: Amplitude and phase for SigmaA-weighted 2mFo-DFc map coefficients
- DELFWT/PHDELWT: Amplitude and phase for SigmaA-weighted mFo-DFc (difference) map coefficients
- FOM: Figure of merit for calculated phase, computed using error estimates deduced from assumed RMS coordinate errors
- HLA/HLB/HLC/HLD: Corresponding Hendrickson-Lattman coefficients
As noted above, the MTZ file produced by the anisotropy correction mode contains corrected F/SIGF values
 .sol file
Potential solutions are described in a .sol file. Each potential solution starts with a SOLU SET line, and subsequent SOLU 6DIM lines describe the orientations and positions of molecules making up the solution.
 .rlist file
Similar to a .sol file. If there are SOLU 6DIM lines, these describe the partial structure that was known prior to the rotation search. SOLU TRIAL lines describe the possible orientations for the ensemble tested in the rotation search.