8 Research and Publish


Within a longitudinal study, once data are ready for analysis an additional search for existing data and research, extending a search that might have been done at step 1.2, might be desirable. Longitudinal studies involving meta-analyses in which no new data are collected are also possible.


Data from an ongoing study may need to be extracted, possibly from multiple sources. Access controls and identity management may come into play, particularly in multi-institutional studies.


Data may need to be refactored from their form in production systems to forms usable by data analysts. Different analyses may require new variables. A spatial analysis, for example, may require geocoding the data. Data structures may have to be changed. To fit the requirements of analysis tools, a “tall skinny” form of a table containing one row for each combination of subject and time period might need to be transformed into a “short wide” table with one row for each subject and separate columns for each time period.


Analysis of the data may involve running a predetermined set of programs on the data, or might involve a complex iterative process. In either case, good practice would involve the specification of an analysis plan before the data collection begins and recording deviations from that plan as analysis proceeds. This is particularly true in cases where there are many possible statistical significance tests on a set of data. Finding a small set of prespecified significant results is interpreted very differently than searching for and choosing any statistically significant results out of a large set of possibilities.

Replication, the “gold standard” in science, requires a detailed description of the analysis process. Best practice might involve including analysis computer scripts along with related metadata (version of the software used to run the script, operating system and hardware used, etc.) in the extended metadata to be archived.


In addition to text, publications may include tables and graphics. Methods used to produce those figures should be included in the metadata associated with the data used for the publication. In cases where the data are analyzed from an extract from an evolving production system, good practice would involve either preserving the extract or some method to recreate exactly that set of data. Best practice might involve including in a publication a persistent identifier usable for locating the data (see, for example, The DOI ® System) as well as a reproducible identification measure for the content of a set of data, such as the Universal Numeric Fingerprint (see Altman and King 2007) to ensure that attempts at replication are really using the same data even if represented in a different software format.


When there are confidentiality constraints on the underlying data, tables and graphics derived from those data must be evaluated for disclosure risk. Detailed statistical models may also require evaluation.


Publication of research results may require negotiation of arrangements for archiving the data and metadata used for the publication. In some cases, a publisher might require a copy of the data for its archive. Careful negotiation of access rights would be prudent. Metadata for the dataset used in the publication should be updated to include a citation of the publication.