Scaling experimental intensities with Scala

From Media Wiki

Jump to: navigation, search

Main Page - Using the CCP4 software - Data processing with CCP4 - Scaling experimental intensities with Scala

Scala merges multiple observations of reflections and produces a file that contains averaged intensities for each reflection.

The ccp4i task Scale & Merge Intensities also includes the next step, calculation of amplitude |F| from intensity I, using Truncate, and optional generation or import of a FreeR set. Scala requires an input MTZ file containing unmerged intensities, such as that produced by Mosflm.

  • You may wish to put the input file through the Find or Match Laue group task to determine or check the symmetry first. Files from integration programs other than Mosflm may be imported through the Import Integrated Data task: data from XDS or Scalepack may be imported by Pointless, some other formats by Combat; data from d*trek may be imported by dtrek2scala (no ccp4i task). The input file will be sorted if necessary (files from Pointless are already sorted).
  • If the input file contains multiple datasets, eg from a MAD experiment, they will be scaled together, but merged separately. At present, these are split into separate files, since Truncate can only handle one dataset at a time, then recombined into a single file after the Truncate step.


Contents

[edit] Running the Scale & Merge task

Scala has a large number of options, most of which are rarely (if ever) useful. Only the most common options are described here: see full documentation for complete details.

[edit] The main control section

The Scala user interface main controls
The Scala user interface main controls

The top part of the task window contains the main control options, and specification of the input & output files. The most important option here is to specify whether the crystal contains anomalous scatterers. Anomalous differences are always preserved in the output file, but this switch controls whether Bijvoet-related observations (I+, I-) are considered equivalent in the standard deviation corrections, the statistics (eg Rmerge etc) and the outlier tests. Even if anomalous scatterers are present, it is usual to treat Bijvoet observations together in the scaling. There is an option to create a Free R column: it is important that this should be taken from an existing file if you already have created a FreeR set for this crystal form. Resolution limits may be reset here: it is often useful to integrate the images to slightly higher resolution than you think the pattern extends to, then cut back the resolution following a first run of Scala.

[edit] Controlling Truncate & defining datasets

Truncate & dataset definition
Truncate & dataset definition

The panel headed Convert to SFs & Wilson plot asks for an approximate number of residues in the asymmetric unit: this is used only to put the structure factors on a very rough absolute scale, so the value is not very important. This panel also defines how the columns will be labelled in the final merged MTZ file: either the dataset name or a specified string will be appended to each label, eg F_peak, SIGF_peak etc. in this example. For this reason, it is sensible to keep dataset names short.
The dataset names displayed here are read from the input MTZ file, and may be edited in the Define Output Datasets panel. If the datasets have not been defined correctly at an earlier stage, they may be changed here, including the option to combine everything into a single dataset. Maximum flexibility in defining datasets requires explicit definition of "Runs", activated with the button Override automatic definition of 'runs' to mark discontinuities in data (see above).

Scaling protocol definition
Scaling protocol definition

[edit] Scaling Protocol

The default options defined here are recommended in most cases.

Excluding batches or datasets
Excluding batches or datasets

[edit] Excluding data

Following an initial run, it may be useful to exclude batches, either to remove individual bad batches, or to remove batches from the end (you can exclude a range eg from 1181 to 1999). Note that these are the actual batch number, not the batch serial number (ie they do not necessarily start from 1)

[edit] Other options

See below for other options

[edit] Program output: Scala

The scala logfile is rather long with many statistics, but some of them are only useful if there are severe problems. At the end of the logfile there is a summary table which contains the information useful for deposition and publication (View Annotated Log in Web Browser or View Log File and click on "Show Summary"). Note that if you scale multiple datasets together, they are each merged (averaged) separately, so the logfile contains separate statistics for each dataset, as well as some correlation statistics between them. The statistics are best viewed as graphs (View Log Graphs or View Annotated Log in Web Browser) as illustrated below. Some other graphical output files (eg normal probability plots) are listed in the Output files as *.xmgr (see right): these are displayed with the program "xmgrace" if installed, or otherwise with loggraph.

Based on the output, you need to make some judgements about your data.

  • Are there bad batches (individual bad batches or ranges of batches)?
  • Was the radiation damage such that you should exclude the later parts?
  • What is the real resolution? Should you cut the high-resolution data?
  • Is there any apparent anomalous signal?
  • Is the outlier detection working well?
  • What is the overall quality of the dataset? How does it compare to other datasets for this project?

[edit] Analysis by Batch number (equivalent to time)

[edit] Scales & B-factor

The relative B-factor is principally a (very rough) correction for radiation damage. Data with B-factors < -10 should be treated with suspicion. Note that B is ill-determined with low-resolution data and perhaps should not be refined. There is nothing you can do to correct for severe radiation damage, unless you have designed your experiment specifically to exploit it (extrapolation to zero-dose needs every reflection to be measured at several well-spaced time intervals).

[edit] Rmerge etc

Are there any bad patches in the dataset? Examine graph of Rmerge v. BatchNumber

Left: something was horribly wrong at the beginning: in this case, there was poor orientation matrix at the start, fixed by re-running Mosflm Centre: one bad batch Right: steady decline in quality of data, in this case from radiation damage

A realistic cutoff is around 1.8Å
A realistic cutoff is around 1.8Å

[edit] Analysis by resolution

What is the real resolution? The best guide is the average signal/noise after averaging symmetry-related observations < <I>/σ(<I>) >, labelled Mn(I)/sd(Mn(I)) in the Scala table. A typical cut-off is ~2: weaker data may make a small difference in maximum-likelihood weighted refinement, but not much. It certainly will not give good experimental phases. Another indicator of the maximum resolution is the correlation coefficient on I between half-datasets, see below. Resolution cut-offs ought to take into account anisotropy of diffraction, which is not handled gracefully by current software.

A good Rmerge for the strongest reflections
A good Rmerge for the strongest reflections

[edit] Analysis against intensity

Rmerge is obviously always large for small intensities, but its value for the largest intensities should be in the range 0.01 to 0.04 for good data sets (0.033 in example on left). Larger values suggest that there are some worrying systematic errors.

[edit] Completeness & multiplicity

Datasets should be complete, as near to 100% as you can manage. Some loss of completeness can be tolerated in the outermost resolution bins. Dangers:

  • watch out for bad anomalous completeness (see example left). Note that in P1 you need a rotation of at least 180°+2θmax (better 360°) to get complete anomalous data.
  • Note that mosflm may integrate into the corners of square detectors, leading to very incomplete data at the maximum resolution. You should apply a high resolution limit (see example on right).


High multiplicity gives more accurate data and is essential for use of the weakest anomalous signals (eg sulphur), and also allows more reliable elimination of outliers, but note that Rmerge will tend to rise with increasing multiplicity even though <I> is more accurate. The multiplicity-weighted Rmeas (=Rrim) does not suffer this problem, and Rpim gives a measure of the accuracy of the average I.

Standard deviation of normalised deviation
Standard deviation of normalised deviation

[edit] Analysis of Standard Deviations

Scala compares the observed scatter of symmetry-related observations around the mean with the estimated σ(I), and "corrects" the σ(I) by a multiplier and a function of <I>. On average, the observed scatter should equal the estimated standard deviation.

σ'2(I) = SdFac22(I)+ SdB <Ih> + (SdAdd <Ih>)2}
where <Ih> is the average intensity of reflection h.

By default, the three parameters SdFac, SdB & SdAdd are determined automatically, with separate values for fulls & partials, and for each run. The plot on the right represents the distribution of normalised errors δ = (I-<Ih>)/σ(I) as a function of <Ih>. This distribution should have a standard deviation of 1.0, so the plot should be horizontal with a value = 1.0 everywhere. If it slopes upwards, then σ(I) is too small for large I, so SdAdd should be increased, or reduced if it slopes downwards.

Correlations of anomalous differences between datasets
Correlations of anomalous differences between datasets
Correlations of dispersive differences between datasets
Correlations of dispersive differences between datasets

[edit] Correlations within and between datasets

If you have multi-wavelength datasets, then the anomalous differences for each dataset should be correlated, as should the dispersive differences between them. Examination of these correlation coefficients as a function of resolution gives an indication of reliability of the signal. In this 3-wavelength example, the correlations on anomalous differences decline with resolution, as expected, and look good to at least 3.5Å. Bizarrely the correlation on dispersive differences goes down to zero at about 3.8Å resolution, then climbs to a high value again at high resolution, which is extremely suspicious: indeed, the dispersive differences did not give much useful phasing though the anomalous differences did.

Correlations of anomalous differences within dataset
Correlations of anomalous differences within dataset










Within one dataset, an equivalent correlation analysis is done between random halves of the dataset: clearly this only works well with a reasonably high multiplicity (at least 4, preferably higher). The plot on the left is for the "peak" dataset from the same 3-wavelength set as the examples above. The anomalous correlation (red) is always lower than that for the complete dataset compared to the "edge" dataset, reflecting the lower multiplicity in the half-set. The centric data (blue) should have a correlation coefficient of zero, but the number of data is small so the error is large. The correlation coefficient on I (as opposed to ΔI, green) is another indicator of the maximum real resolution: if it is falling rapidly at the edge, perhaps a high resolution cutoff should be applied.

 Correlation ratio
Correlation ratio
 Scatter plot of anomalous difference correlations
Scatter plot of anomalous difference correlations

Correlation of anomalous differences within a dataset are also analysed in the "correlation plot" ("..._correlplot.xmgr" in the View Files from Job pull-down). This is just a scatter plot of the two estimates of Δanom from the half-datasets. If they are correlated, then the scatter plot along the diagonal (see left). The degree of elongation may be measured as the ratio of the RMS widths of the distribution along the diagonal and perpendicular to the diagonal. This "RMS correlation ratio" is plotted against resolution in the loggraph plots and like the correlation coefficient may be used to choose a resolution cutoff for locating heavy atoms

[edit] Normal Probability Analysis

An example of a good dataset
An example of a good dataset

To view the normal probability plots, in "View Files from Job", click on <name>_normplot.xmgr (or <name>_anomplot.xmgr for the anomalous probability plot). These plot the normalised deviation δ = (I-<I>)/σ against what would be expected from a Gaussian (Normal) distribution, thus if the data truly followed a Gaussian distribution of errors with the estimated σ, then all the points would lie on the diagonal, with most of the observations in the centre and fewer the further away from the centre along the diagonal. Real data always has more deviant observations than predicted from a Gaussian, shown as the points curving away from the diagonal to the top right & bottom left. The SdFac correction forces the slope = 1 in the centre, but the overall truth of the error assumptions may be judged from (a) how close to the centre that the points deviate from the diagonal (b) how similar are the sub-sets of data, fulls, partials & different runs.

Poor
Poor
Not so good but tolerable
Not so good but tolerable






If the plot is poor, there is not much you can do about it, except treat it as a general indicator that the quality this dataset may not be the best.

Weak but significant anomalous differences
Weak but significant anomalous differences

In the case of the anomalous probability plot, δ = (I+ - I-)/σ and a slope > 1 in the centre indicates that the measured anomalous differences are greater than would be expected from the standard deviations.
However, note the example on the right, where there appears to be no significant difference, but nevertheless there was useful phasing information, in combination with other datasets.

[edit] Program output: Truncate

The program "truncate" (or "ctruncate") is normally run immediately after Scala, though it may be run as a separate task (Utilities->Convert Intensities to SFs). It converts intensities to structure amplitudes |F| using the procedure of French & Wilson For perfect data, |F| = √I but in the presence of errors, for small intensities the best estimate of F is larger than √I. Truncate estimates the "best" F based on the probability distribution of I.

It also produces intensity statistic plots which help estimate data quality: these are surprisingly predictable for all crystals - see theory section. However this is not so in cases where there is more than one molecule in the crystal asymmetric unit related by a simple translation (referred to as non-crystallographic translation (NCT).) This can be detected by inspecting a native Patterson, and means whole classes of reflections become weak (eg a pseudo-translation of ~0.5,0,0 will lead to reflections with h odd being weak compared those with h even) and which will distort the intensity distributions.

The first indicator to inspect is the Wilson plot, which is not affected by any NC translation . This shows the fall of mean intensity with resolution. For a protein there has a very predictable form: there is a dip at 5Å resolution, then the plot should be fairly linear from 4Å to the diffraction limit. If this is not so there may be a problem with integration. A plot which turns up at the end shows that the resolution limit should be reduced.

The second useful indicator is to look at the anisotropy plot (again not affected by NCT). This plots <intensity> for intensities which lie near the reciprocal lattice axes, a* b* and c*. If the plots do not overlap and the mean along one axis is significantly weaker than along others you have anisotropic data. Current crystallographic software does not generally handle seriously anisotropic data gracefully, and in the weak direction there will be less information in the data. The analyses described below need to done after correcting the data for this anisotropy.

Two different types of distributions can be used to detect twinning. In a merohedral or pseudo-merohedral twin each observed reflection is actually the sum of two different reflections related by some "twin" symmetry operator. The moments of the observed normalised amplitudes E are distorted to lie nearer to 1.0 than expected. (E is the normalised amplitude, <E2> = 1.0 at all resolutions), see plots on right.

The cumulative intensity plot (left) is also distorted by twinning: there are apparently too few weak or strong intensities, since if two random intensities are added, it is unlikely that both are weak or both strong. The effect of this may be seen as a sigmoidal cumulative intensity plot (left, green line lying between the theoretical red line for untwinned data and the theoretical blue line for a perfect twin). This plots the number of reflections with intensities less than Z, the fraction of the average intensity (a function of resolution).

The phenomenon of too few weak intensities can also arise from the spots not being properly resolved on the images: weak reflections are then over-estimated due to the tails of neighbouring strong reflections running into them.

Intensity statistics are distorted for very anisotropic data and also if there is translational NCS, since they compare intensities to the mean intensity, which is assumed to vary only with resolution. In that case, the cumulative intensity plot is usually above the theoretical line, and this may obscure twinning.


[edit] Advanced options

[edit] Corner corrections


Image of corner corrections
Image of corner corrections

CCD detectors underestimate the intensities of spots close to the edges and particularly in the corners of the tiles, due to the point spread function from the optical taper. This is a significant problem for 3x3 tiled detectors, as the corners lie in critical parts of the diffraction pattern. The spot intensities may be corrected using a calibrated correction table for the individual detector (different for each detector), using the pixel coordinates in the HKLIN file. The table is given to Scala as an ADSC image format file, and activated by the CORNERCORRECT command. (Acknowledgements for this correction are due to: Andy Arvai, Xuong Nguyen-huu, Chris Nielsen, Raimond Ravelli, Gordon Leonard, Sean McSweeney, Sandor Brockhauser, and Andrew McCarthy)

Distribution of outliers on detector
Distribution of outliers on detector

Note that at present Scala has no way of knowing which detector was used, so it is up to the user to provide the correct file: correction files should be available from the synchrotron beamlines, or from the detector manufacturers.

The effect of the error can sometimes be seen in the distribution of outliers on the detector, see the example on the left (*_rogueplot.xmgr in the Output files list): this example also shows outliers along the rotation axis, possibly due to incorrect masking of the beam stop shadow in the integration process).

[edit] Program documentation

The latest version of the documentation is available here. This provides information on program keywords which may be used from the command line.

This page describes Scala version 3.3.6 (CCP4 version 6.1.0).

[edit] References

P.R.Evans, Scaling and assessment of data quality, Acta Cryst. D62, 72-82 (2005).
Note that the definitions of Rmeas and Rpim in this paper are missing a square-root on the (n/n-1) or (1/n-1) factors.

French G.S. and Wilson K.S. Acta. Cryst. (1978), A34, 517.

Personal tools