As mentioned in the previous post, PROV staff David Fowler and Andrew Waugh have been thinking about structured data. One of the avenues that has been addressed (a quick walk down, rather than a detailed inspection) is around what are the options for capturing structured data in a long term format. This issue deals with how the data in a data source should be represented when archived (either in a permanent repository, such as PROV, or a temporary repository in an agency).
The options identified at the moment are: a machine processible low level representation; a single report; or individual records.
Machine Processible. This approach captures a representation of the underlying structured data source (typically multiple tables) – a good example is SIARD.
- This approach gives the richest and most flexible data for subsequent researchers as they can easily mine the data or combine it with other data.
- It may be more work than other approaches, and will be more difficult for researchers to access the data (as they will need to load it into a database and develop queries).
Report. This approach is to generate a report that contains the records in the structured data source.
- The advantages of this approach are that the system may already be capable of generating this report, and the cost would consequently be negligible; ii) systems often have the capability to generate ad-hoc reports and, in this case, the cost of extracting the records would be relatively small; iii) a report is essentially self documenting with little need to document the underlying data source; and iv) the report would be easily used by researchers into the future.
- The main disadvantages are i) difficult to search for records of interest; ii) lack of flexibility in using the data in the future; iii) potential loss of metadata that would assist in ensuring the integrity and authenticity of the record; iv) potentially a very large report, v) for some (many?) data sources the report would not adequately reflect the original records.
Individual Records. This approach is similar to the Report approach previously, but the report is broken up into individual ‘files’ and ‘records’ that are separately managed in PROV’s archival control model.
- For records that can be broken up in this way, the advantages are i) this approach is easy for casual researchers to find and use files/records, and the presentation echoes traditional recordkeeping; ii) it mimimises the size of individual records.
- The disadvantages are: i) it makes it extremely difficult to re-use the data source as a whole; ii) it would be more difficult to generate for the agency than a single report, and more time-consuming for PROV to process; and iii) some (many?) data sources could not be broken up into files/records without losing significant information.