2 Design and Redesign


This step may involve identifying specific sources of existing data and of outside expertise. It may be necessary to identify people or institutions controlling access to potential research subjects or other resources necessary for the project. Some instruments may be proprietary and specific arrangements may have to be made to use them.


A simple sampling method may be inadequate to accurately represent a particular universe. Special procedures may be needed to sample difficult to reach portions of the universe. This step should also include estimates of statistical power and consideration of the analysis methods demanded by the chosen sampling method.


Concepts to be measured may have been chosen, but in this step the specific form of the measurement tool is designed. For surveys this includes specifying categories and codes, the wording of questions, and the flow of the questionnaire. Other concepts may need to be measured through analysis of physical samples, as with biomarkers, or through the action of some mechanical or electronic instrument, as with measuring physical location. Custom sensors may need to be designed.

Data may also be obtained by coding something observed, either in real-time, or from recorded material. Observational settings may be structured, as in interviews or focus groups, or unstructured. Data may also be obtained through mining public online sources. This may involve software design.

Careful consideration should be given to any methods to improve data quality at the collection point rather than in post-processing.


The OECD defines a data element as “a unit of data for which the definition, identification, representation, and permissible values are specified by means of a set of attributes (Source: http://stats.oecd.org/glossary/detail.asp?ID=538).” Defining data elements for the study goes hand-in hand with designing collection instruments. The ultimately desired data elements, though, may be measured indirectly. A rate, for example, may be measured as a distance and a time. Body Mass Index may be measured as a weight and a height. More complex scale scores may need to be calculated. This step may identify what post-processing is necessary to compute the final set of data elements.

Designing data elements to be reused across waves or phases of a longitudinal study will be important. The use of versioned, persistent identifiers for new data elements, as implicit in DDI, is an important part of this step. Choosing data elements that have been used in other studies is also desirable, enhancing the potential for reuse.


This step identifies the exact procedures to generate any derived measures as well as any steps needed to improve data quality. These processes become an important part of the definition of a data element. When they change, the meaning of the data element may change. Careful documentation here is an important part of documenting data quality.

Software may be purchased or developed to facilitate automated cleaning. Use of visualization tools may be planned. Staff training might need to be scheduled.


It may be desirable, or required to specify an analysis plan in advance of collecting data. This does not preclude deviation from the plan, but may strengthen the impact of certain types of results when a prespecified plan is carried out (see step 8.4).


More formal arrangements for building the team are made here. Recruitment, hiring, and training may take place. An organizational structure may be developed. Some thought should be given to the appropriate leadership style, and agreement on roles for the project should develop.


This step may involve arranging for space and equipment for the team, communications infrastructure, travel arrangements, and more. Computing infrastructure may need particular attention – e.g., hardware, software, networking, security, and storage. Thought should be given to where bottlenecks might occur during production phases.

Infrastructure needs will extend beyond the collection, analysis, and publication phases to the archival life of the data.