One of the major work activities for the VERS program is focused on Structured Data. Two PROV staff members; Andrew Waugh and David Fowler have undertaken some initial thinking, which is documented (in part) below.
For PROV the long term goal of the Structured Data work program is to prepare standards, specifications, and guidelines on managing structured data within the Victorian government to ensure its continued accessibility as long as it is required. For the purposes of this work program PROV does not draw any distinction between data and records.
Structured data sources will include the following:
- Business applications based on relational database technology
- Data sets (e.g. scientific data sets)
- GIS data
The initial phase of the work program will address a number of issues including;
What portion of the data within a structured data source should be preserved
This issue is related to the distinction between data and records. Many archival discussions (e.g. see the ICA Functional Requirements for Business Systems) assume that not all the data in a structured data source will be part of a (permanent) record. Consequently, the first task facing the records manager is to identify the data that forms part of the record. This data may be distributed across multiple tables in the structured data source.
The question is: what should PROV consider to be the record to be preserved. The options are: transactions, a redacted version of the data source, the whole version of the data source. The issue discussion should canvas the criteria for determining what approach to take (it may be that different data sources need different approaches).
Transactions. Some very early literature (e.g the researchers that developed the Pittsburgh model) have suggested that the data in the data source is not the record; the record is actually the transactions against this data source. These transactions are i) the updates to information in the data source, and ii) the reports generated from the data source (which may only be displayed on screen) on which decisions were made. This approach would cause practical problems
- the difficulty in compiling update transactions to give a view of the state of a particular entity at a given point in time
- few data sources capture this information (particularly ad-hoc queries presented on screen).
An alternative view is that the data source should be designed to capture the history of data overtime.
Redacted Version of Data Source. This seems to be the common archival view: that only some data in a data source will need to be treated as a record, and that the data source should be redacted to only preserve this data. The benefits of this approach are:
- that only some data will need to be managed as a record (e.g. keeping a history of the data over time) – which has benefits for both the agency and PROV
- that some applications require redaction of the preserved records (e.g. de-identified health data).
The disadvantages are that
- redacting the data in the data source is an expensive undertaking, so insisting on this approach creates barriers to transfer
- redacting the data source in this way creates a risk of damaging the integrity of the data.
Some options for capturing the data as records (e.g. capturing a report) naturally lend themselves to redacting data.
Capture of all the Data Source. This approach captures a representation of all the data in the data source. This is the simplest, least risky approach, and may be the least expensive approach. On the other hand:
- It may have security/privacy implications.
- It requires more time to document.
- Future researchers may have greater difficulty in accessing the information.
This is the first of a number of posts on this topic