Database Design for the XAS Data Library¶
Important note: this section is currently far out-of-date and needs attention.
This page is a Work-in-Progress Proposal for how to build a Next Generation XAFS Data Library The main idea is to facilitate the exchange XAFS specra, particularly on model compounds, such as found in Model Compound Libraries 1, 2, 3, 4 We will create a mechanism to store multiple XAFS spectra in a manner that can be used within dedicated applications and embedded into existing data processing software with minimal effort.
Motivation¶
Storing and Exchanging XAFS Data is an common need for everyone using XAFS. In particular, retrieving data on Standards or Model Compounds is a continuing need for analysis. Additionally, data taken on samples at different facilities and beamlines need to be compared and analyzed together. At this point, there is no commonly accepted data format for XAFS data. There have been a few attempts 5
Most beamline data collection software, and most processing and analysis software use some variation of ASCII Column Data Files, in which data is written as delimited text in a text file with each row representing a datum of stepped scan in energy, with a header of some sort giving some idea of the files contents. While such files may be easily read by humans, they may not be so easily to read by machine, as differences in file formatting conventions mean that there is no standardized way to specify header information or the contents of the various columns in the data files.
Thus, while being human readable, ASCII Column Data Files are not always easily interpreted without first-hand knowledge of how the files were produced. As an important case in point, the data files in the Farrel Lytle XAFS Spectra Database use a few different formats in which energy is stored as monochromator steps, with the needed information (steps / degree) to decode these values put in an unmarked place in the header. In addition, it is rare to have much “meta-data” about sample or measurements conditions provided in the data file. When they are included, they are usually as short, cryptic notes difficult to decipher a few months after the experiment.
In addition, ASCII Column Data Files offer no standard way to hold multiple spectra. Yet, sharing and comparing multiple spectra is part of the need when exchanging data. In order to facilitate the sharing data between beamlines and the exchange (and recognition) of high-quality data on standards, it is useful to have Data Libraries or Repositories which a scientist can use at will and exchange with others. A key point is that not only should data be well-formatted and vetted for quality, but should also be easily be viewed as a part of Suite of Spectra.
A Data Library is often envisioned as a single centralized Library (see http://xafs.org/Databases) which is meant to house Standard Data with some assurance of quality and the expectation that the data is to be shared with the entire community. In contrast, many researchers keep their own set of data closely guarded and do not want to share their data. Another model for a Data Library is a set of spectra being shared between collaborators, but then possibly allowed to have a wider distribution (say, after some paper has been published). I am considering all of these as legitimate use cases, and am more interested here in creating a format that can be used for all these cases.
Implementation Overview¶
With the motivation from the previous section, the goals for the format and library are:
Store spectra for exchange, especially for model compounds. Raw data, direct from the beamline will probably need to be converted to this format.
Store information about the sample, measurement conditions, etc.
Store multiple spectra, either on the same sample or multiple samples, and possibly taken at many facilities.
Provide programming libraries and simple standalone applications that can read, write, and manage such data libraries. Programming libraries would have to support multiple languages.
There are a few reasonable ways to solve this problem. What follows below is a methods which makes heavy use of relational databases and SQL. The principle argument here is that relational databases offer a well-understood, proven way to store data with extensible meta-data. The use of SQL also makes the programming libraries simpler, as they can rely on tested SQL syntax to access the underlying data store.
Development Code and Data¶
As the XAS Data Library is being developed, code and examples will be available at http://github.com/newville/xasdatalib
Why SQLite¶
I propose using SQLite, a widely used, Free relational database engine as the primary store for the XAFS Data Library. A key feature of SQLite is that it needs no external server or configuration – the database is contained in a single disk file. SQLite databases can accessed with a variety of tools 8, 9
Challenges Using SQL Tables for Numerical Array¶
SQL-based relational databases may not be the most obvious choice for storing scientific data composing of arrays of related data. One obvious limitation is that relational databases don’t store array data very well. Thus storing array data in a portable way within the confines of an SQL database needs special attention. The approach adopted here is to JSON, which can encapsulate an array, or other complex data structure into a string.
Using JSON to store array data¶
JSON – Javascript Object Notation, provides a standard, easy-to-use method for encapsulating complex data structures into strings that can be parsed and used by a large number of programming languages as the original data. In this respect, the requirements for the XAS Data Library – numerical arrays of data – are fairly modest. Storing array data in strings is, of course, what ASCII Column Files have done for years, only not with the benefit of a standard programming interface to read them. As an example, an array of data [8000, 8001.0 , 8002.0] would be encoded in json as
[8000, 8001.0, 8002.0]
This is considerably easier and lighter weight than using XML to encode array data.
In addition to encoding numerical arrays, JSON can also encode an associative array (also known as a Hash Table, Dictionary, Record, or Key/Value List. This can be a very useful construct for storing attribute information. It might be tempting to use such Associative Arrays for many pieces of data inside the database, this would prevent those data from being used in SQL SELECT and other statements: such data would not be available for making relations. But, as Associative Arrays can so useful and extensible, several of the tables in the database include a attributes column that is always stored as text. This data will be expected to hold a JSON-encoded Associative Array that may be useful to complement the corresponding notes column. This data cannot be used directly in searching the database, but may be useful to particular applications.
Other Challenges When Using SQLite¶
While robust, powerful and compliant with SQL standards, SQLite does not always provide as rich set of Data Types as some SQL relational databases. In particular for the design here, SQLite does not support Boolean values or Enum fields. Integer Values are used in place of Boolean Values. Enum values (which may have been used to encode Elements, Collection Modes, etc) are implemented as indexes into foreign tables, and JOINs must be used to relate the data in the tables.
Tables and Database Schema¶
The principle data held in a XAFS Data Library is XAFS Spectra. In addition, it is useful to include data on Sample preparation, measurement conditions, and so on. In addition it is useful to be able to combine several spectra into a Suite, and to identify the people adding to the library. Thus the XAFS Data Library contains the following main tables:
Table Name |
Description |
---|---|
spectra |
main XAS spectra, pointers to other table |
sample |
Samples |
crystal_structure |
Crystal structures |
person |
People |
citation |
Literature or Other Citations |
format |
Data Formats |
suite |
Spectra Suites |
facility |
Facilities |
beamline |
Beamlines |
monochromator |
Monochromators |
mode |
Modes of Data Collection |
ligand |
Ligands |
element |
names of Elements |
edge |
names of x-ray Edges |
energy_units |
units for energies stored for a spectra |
While some of these tables (spectra, sample) are fairly complex, many of the tables are really quite simple, holding a few pieces of information.
In addition there are a few Join Tables to tie together information and allow Many-to-One and Many-to-Many relations. These tables include
Table Name |
Description |
---|---|
spectra_mode |
mode(s) used for a particular spectra |
spectra_ligand |
ligand(s) present in a particular spectra |
spectra_suite |
spectra contained in a suite |
spectra_rating |
People’s comments and scores for Spectra |
suite_rating |
People’s comments and scores for Suites |
A key feature of this layout is that a Suite is very light-weight, simply comprising lists of spectra. Multiple suites can contain an individual spectra, and a particular spectra can be contained in multiple suites without repeated data.
The tables are described in more detail below. While many are straightforward, a few tables may need further explanation.
Spectra Table¶
This is the principle table for the entire database, and needs extensive discussion. Several of the thorniest issues have to be dealt with in this table, making this likely to be the place where most attention and discussion should probably be focused.
--
create table spectra (
id integer primary key
name text not null,
notes text,
attributes text,
file_link text,
data_energy text,
data_i0 text default '[1.0]',
data_itrans text default '[1.0]',
data_iemit text default '[1.0]',
data_irefer text default '[1.0]',
data_dtime_corr text default '[1.0]',
calc_mu_trans text default '-log(itrans/i0)',
calc_mu_emit text default '(iemit*dtime_corr/i0)',
calc_mu_refer text default '-log(irefer/itrans)',
notes_i0 text,
notes_itrans text,
notes_iemit text,
notes_irefer text,
temperature text,
submission_date datetime,
collection_date datetime,
reference_used integer,
energy_units_id -- > energy_units table
monochromator_id -- > monochromator table
person_id -- > person table
edge_id -- > edge table
element_z -- > element table
sample_id -- > sample table
beamline_id -- > beamline table
format_id -- > format table
citation_id -- > citatione table
reference_id -- > sample table (for sample used as reference)
We’ll discuss the table entries more by grouping several of them together. First, Each entry in the spectra table contains links to many other tables.
spectra Column Name |
Description |
---|---|
energy_units_id |
index of energy_units table |
person_id |
index of person table for person donating spectra |
edge_id |
index of edge table for X-ray Edge |
element_z |
index of element table for absorbing element |
sample_id |
index of sample table, describing the sample |
reference_id |
index of sample table, describing the reference sample |
beamline_id |
index of the beamline where data was collected |
monochromator_id |
index of the monochromator table for mono used |
format_id |
index of the format table for data format used |
citation_id |
index of the citation table for literature citation |
Next, the table contains ancillary information (you may ask why some of these are explicit while others are allowed to be put in the attributes field).
spectra Column Name |
Description |
---|---|
notes |
any notes on data |
attributes |
JSON-encoded hash table of extra attributes |
temperature |
Sample temperature during measurement |
submission_date |
date of submission |
reference_used |
Boolean (0=False, 1=True) of whether a Reference was used |
file_link |
link to external file |
Here, reference_used means whether data was also measured in the reference channel for additional energy calibration . If 1 (True), the reference sample must be given. The file_link entry would be the file and path name for an external file. This must be relative to the directory containing database file itself, and cannot be an absolute path. It may be possible to include URLs, ….
Finally, we have the information for internally stored data arrays themselves
spectra Column Name |
Description |
Default |
---|---|---|
data_energy |
JSON data for energy |
– |
data_i0 |
JSON data for I_0 (Monitor) |
1.0 |
data_itrans |
JSON data for I_transmission (I_1) |
1.0 |
data_iemit |
JSON data for I_emisssion (fluorescence, electron yield) |
1.0 |
data_irefer |
JSON data for I_trans for reference channel |
1.0 |
data_dtime_corr |
JSON data for Multiplicative Deadtime Correction for I_emit |
1.0 |
calc_mu_trans |
calculation for mu_transmission |
-log(dat_itrans/dat_i0) |
calc_mu_emit |
calculation for mu_emission |
dat_iemit * dat_dtime_corr / dat_i0 |
calc_mu_refer |
calculation for mu_reference |
-log(dat_irefer/dat_itrans) |
calc_energy_ev |
calculation to convert energy to eV |
None |
notes_energy |
notes on energy |
|
notes_i0 |
notes on dat_i0 |
|
notes_itrans |
notes on dat_itrans |
|
notes_iemit |
notes on dat_iemit |
|
notes_irefer |
notes on dat_irefer |
The data_***** entries will be JSON encoded strings of the array data. The calculations will be covered in more detail below. Note that the spectra_mode table below will be used to determine in which modes the data is recorded.
Data Storage¶
As alluded to above, the data_***** will be stored as JSON-encoded strings.
Encoding Calculations, particularly for “Energy to eV”¶
The calculations of mu in the various modes are generally well defined, but it is possible to override them.
Energy Units¶
The calculations of mu in the various modes are generally well defined, but it is possible to override them.
Sample Table¶
-- sample information
create table sample (
id integer primary key,
person_id integer not null, -- > person table
crystal_structure_id integer, -- > crystal_structure table
name text,
formula text,
material_source text,
notes text,
attributes text);
Crystal_Structure Table¶
-- crystal information (example format = CIFS , PDB, atoms.inp)
create table crystal (
id integer primary key ,
format text not null,
data text not null,
notes text,
attributes text);
Ligand Table¶
create table ligand (
id integer primary key,
name text,
notes text);
create table spectra_ligand (
id integer primary key,
ligand integer not null, --> ligand table
spectra integer not null); --> spectra table
Person Table¶
create table person (
id integer primary key ,
email text not null unique,
first_name text not null,
last_name text not null,
sha_password text not null);
Citation Table¶
create table citation (
id integer primary key ,
journal text,
authors text,
title text,
volume text,
pages text,
year text,
notes text,
attributes text,
doi text);
Format Table¶
-- spectra format: table of data formats
--
-- name='internal-json' means data is stored as json data in spectra table
--
create table format (
id integer primary key,
name text,
notes text,
attributes text);
insert into format (name, notes) values ('internal-json', 'Read dat_*** columns of spectra table as json');
Suite Table¶
-- Suite: collection of spectra
create table suite (
id integer primary key ,
person integer not null, -- > person table
name text not null,
notes text,
attributes text);
-- SUITE_SPECTRA: Join table for suite and spectra
create table spectra_suite (
id integer primary key ,
suite integer not null, -- > suite table
spectra integer not null); -- > spectra table
Rating Table¶
A rating is a numerical score given to a Spectra or a Suite of Spectra by a particular person. Each score can also be accompanied by a comment.
While not enforced within the database itself, the scoring convention should be Amazon Scoring: a scale of 1 to 5, with 5 being best.
create table rating (
id integer primary key ,
person integer not null, -- > person table
spectra integer, -- > spectra table
suite integer, -- > suite table
score integer,
comments text);
Monochromator and Collection_Mode Tables¶
These two tables simply list standard monochromator types and data collection modes.
-- Monochromator descriptions
create table monochomator (
id integer primary key,
name text,
lattice_constant text,
steps_per_degree text,
notes text,
attributes text);
-- XAS collection modes ('transmission', 'fluorescence', ...)
create table collection_mode (
id integer primary key,
name text,
notes text);
insert into collection_mode (name, notes) values ('transmission', 'transmission intensity through sample');
insert into collection_mode (name, notes) values ('fluorescence, total yield', 'total x-ray fluorescence intensity, as measured with ion chamber');
insert into collection_mode (name, notes) values ('fluorescence, energy analyzed', 'x-ray fluorescence measured with an energy dispersive (solid-state) detector. These measurements will often need to be corrected for dead-time effects');
insert into collection_mode (name, notes) values ('electron emission', 'emitted electrons from sample');
insert into collection_mode (name, notes) values ('xeol', 'visible or uv light emission');
create table spectra_modes (
id integer primary key ,
mode integer not null, -- > collection_mode
spectra integer not null); -- > spectra table
Beamline and Facility Tables¶
These two tables list X-ray (synchrotron) facilities and particular beamlines.
Note that a monochromator is optional for a beamline.
-- beamline description
-- must have a facility
-- a single, physical beamline can be represented many times for different configurations
create table beamline (
id integer primary key ,
facility integer not null, --> facility table
name text,
xray_source text,
monochromator integer, -- > monochromator table (optional)
notes text,
attributes text);
-- facilities
create table facility (
id integer primary key,
name text not null unique,
notes text,
attributes text);
Element and Edge Tables¶
These two tables simply list standard symbols and names of the elements of the periodic table, and the standard names for the x-ray absorption edges. The schema are
create table element (z integer primary key,
symbol text not null unique,
name text);
insert into element (z, symbol, name) values (1, 'H', 'hydrogen');
insert into element (z, symbol, name) values (2, 'He', 'helium');
create table edge (id integer primary key,
name text not null unique,
level text);
insert into edge (name, level) values ('K', '1s');
insert into edge (name, level) values ('L3', '2p3/2');
insert into edge (name, level) values ('L2', '2p1/2');
insert into edge (name, level) values ('L1', '2s');
Supported Low-Level Data Formats¶
Initially, the principle data format for the XAS Data library will be Internally Stored, JSON-encoded data arrays. Storing data internally has the advantage of preserving the database as a single file. JSON-encoded arrays have the advantage of being readily useful to many languages and environments. Alternate internal formats could be allowed, but no such formats are yet identified.
External data
Example Queries¶
Programming Interface(s)¶
References, External Links¶
Notes¶
Footnotes