Jump to main content or area navigation.

Contact Us

Computational Toxicology Research Program

Chemical Information Quality Review Procedures

blue bullet graphic DSSTox SDF Data File Construction Flowchart

blue bullet graphic DSSTox Source-Content Fields

blue bullet graphic Chemical Structure Identification and Annotation Procedures

blue bullet graphic Use of DSSTox Master File

blue bullet graphic Public and Commercial Chemical Structure Information Resources

blue bullet graphic Quality Review in DSSTox Master File

blue bullet graphic DSSTox Chemical and Substance IDs

blue bullet graphic Locating Text Errors in DSSTox Standard Chemical fields

blue bullet graphic User-Submitted Error Reports

Note: This page has been updated to include information on use of the DSSTox Master File and new standard fields DSSTox_RID, DSSTox_Generic_SID, and DSSTox_FileID in DSSTox Data File construction. (August 2007)

 

DSSTox SDF Data File Construction Flowchart:

The process of DSSTox SD File construction begins with a Source Data File, obtained from a public website, a Source Collaborator, or created from a publication, document, or public source of data. A DSSTox Data File NAMEID (e.g., CPDBAS) is assigned, and each Source record is given unique identifiers: DSSTox_RID (referred to as RID in flowchart below) and DSSTox_FileID (referred to as FileID in flowchart below). Henceforth, Source data content and chemical content are processed separately, but remerged at the end of the process using the common DSSTox_FileID identifiers. The flowchart below illustrates the process used in the generation and assignment of DSSTox Standard Chemical Fields to new Source records. QA procedures outlined on this page pertain to elements of this process. Click on the section of interest for more details.

DSSTox File Construction FlowChart showing details of 3 main elements of process: Source-content, Structure-annotation process, and use of DSSTox Master File.  Description of elements of this flowchart are provided in corresponding text sections.

Back to top list. Return to Top

DSSTox Source-Content Fields:

Source Content refers to the source data component of a new DSSTox Data File, i.e., fields pertaining to Source-specific material not pertaining to the DSSTox Standard Chemical Fields. This content can take many forms, including toxicity data, experimental notes, quantitative or qualitative activity measures, chemical properties, activity predictions, URLs to general websites or chemical-specific data pages, etc. See DSSTox Central Field Definition Table for a full listing of current data fields present in DSSTox Data Files.

Note: Source-provided chemical structures or SMILES are considered potentially unreliable and used only for confirming chemical identification in case of Source-provided CAS and chemical name discrepancies, or for final structure consensus validation.

Back to top list. Return to Top

Chemical Structure Identification and Annotation Procedures:

A Candidate Source-provided data file typically contains a list of chemical names, most often with CAS registry numbers (CASRN), and occasionally with structures or SMILES. After separating the Source-content fields from the chemical-content fields (see Flowchart above), a process is undertaken to interpret the Source-provided chemical information and populate the DSSTox Standard Chemical Fields as accurately as possible. The general procedure is as follows:

Determine that Source-provided "chemical name", or "chemical name + CAS", are sufficient to unambiguously identify generic form of chemical for further annotation:

Enter chemical name and/or CASRN into ChemID (http://chem2.sis.nlm.nih.gov/chemidplus/chemidlite.jsp exit EPA) or other Public/Commercial resources and check that the CAS number and chemical name agree. As indicated in the Flowchart above, cases where original Source chemical annotation is insufficient to identify a chemical require additional efforts to resolve, e.g.:

chemical name lacks sufficient specificity,
e.g., the name "chlorophenol" does not specify the position of the Cl on the ring relative to the OH

Source CAS and chemical name (or structure) do not agree, calling into question what was actually tested,
e.g., Source-provided CAS is for the parent form whereas Source-provided chemical name specifies Na salt form

If Source-provided CASRN and chemical name do not agree:

The obvious problems are checked first:

Is the CASRN valid based on an automated numerical check (http://www.cas.org/expertise/cascontent/registry/checkdig.html exit EPA)?
Is the CASRN transposed (e.g. 1234-57-6 rather than 1234-56-7)?
Is chemical name misspelled or arranged in a non-typical manner?
Is chemical name or CASRN referring to a different salt or complexed form of the chemical or a different isomeric form?
Is the CASRN valid but retired or replaced by a new CASRN?

If conflict still not resolved:

the Source Collaborator is consulted to resolve issue;
the Source literature citation is examined for any additional information on that particular compound, i.e. a supplier;
search of the web is instituted with emphasis on literature references that have toxicity data of a similar type as new database under construction.

The search usually allows an educated (organic chemist's) assessment of which is correct, the chemical name or the CASRN. The final chemical name and CASRN are submitted to the Source Collaborator for reconciliation and final resolution of conflict.

If CASRN and Chemical Name cannot be definitively reconciled:

If an "best judgement" identification can be made with some reasonable level of confidence, but with some residual uncertainty, the record is retained for inclusion in the DSSTox Data File and a note to this effect is entered in the corresponding Note_NAMEID field for the file. If a chemical identification cannot be sufficiently resolved in association with a Source record so as to make a reasonable "best judgement" identification after consulting with the Source, the record is rejected for inclusion in the DSSTox Data File.

Back to top list. Return to Top

Use of DSSTox Master File :

Having achieved sufficient chemical identification, the next step is to populate DSSTox Standard Chemical Fields:

If CASRN is available, cross reference to existing DSSTox Master File for matching content:

If match is found:

Use DSSTox Master File to assign existing DSSTox_Generic_SID and DSSTox_CID, and populate all remaining DSSTox Standard Chemical Fields, including STRUCTURE.
TestSubstance_ChemicalName field is the only DSSTox Standard Chemical Field that retains the Source-provided chemical name.

If CASRN is unavailable or no match is found in DSSTox Master File, public and commercial resources are consulted for Structure-assignment to chemical name:

Enter chemical name and/or CASRN into ChemID (http://chem2.sis.nlm.nih.gov/chemidplus/chemidlite.jsp exit EPA) to retrieve initial candidate structure.
Use ACD Labs Name-to-Structure (v10.0) exit EPAsoftware to generate candidate structure
The structure is further checked and confirmed by Public and commercial chemical structure information resources for a minimum total of 3 confirming structure sources.
when a chemical structure is not found in the primary data resources, or if discrepancies cannot be resolved with the available public resources, a net is cast more broadly by general internet data searching;
if discrepancies cannot otherwise be resolved, or questions remain, the last resort is to consult the commercial CAS SciFinder exit EPAresource as a definitive source of CAS registry numbers matched with chemical name and structure information; this resource must be used sparingly due to the significant cost and limited availability for our use within EPA.

If CASRN is unavailable and Structure has been assigned, chemical must be incorporated into DSSTox Master File:

Once a structure is assigned and validated by multiple sources and expert organic chemist review, it is cross referenced against the DSSTox Master File to locate matching existing chemical records, for which DSSTox Standard Chemical Fields are already available.

If match is found:

Use DSSTox Master File to assign existing DSSTox_Generic_SID and DSSTox_CID, and populate all remaining DSSTox Standard Chemical Fields, including STRUCTURE.
TestSubstance_ChemicalName field is the only DSSTox Standard Chemical Field that retains the Source-provided chemical name.

If match is not found:

a new DSSTox_Generic_SID and DSSTox_CID are generated and the remaining Standard Chemical Fields are populated using automated procedures detailed elsewhere (see More on DSSTox Standard Chemical Fields).

Once all Source chemical records have been populated with DSSTox Chemical Standard Fields:

All Source NAMEID records are extracted from the DSSTox Master File. Original DSSTox_FileID identifiers (denoted FileID in above Flowchart), which were originally assigned to both sets of fields, are used to recombine these records with the processed Source-Content fields. This merged set of DSSTox Standard Chemical Fields + Source Content fields are then used to construct the final DSSTox Data File for publication.

Back to top list. Return to Top

Public and Commercial Chemical Structure Information Resources:

Listed below are the main commercial and public chemical data resources used in DSSTox file creation and quality review.
Disclaimer: This list is not meant to be exhaustive, nor does mention of trade names imply endorsement by us or the EPA.

Public on-line resources: exit EPA

National Library of Medicine (NLM) ChemID Plus: http://chem2.sis.nlm.nih.gov/chemidplus/chemidlite.jsp

CambridgeSoft's ChemFinder.com Database & Internet Searching: http://chemfinder.cambridgesoft.com/

International Programme on Chemical Safety: http://www.intox.org/databank/index.htm

National Institute of Standards and Technology (NIST) Chemistry Webbook: http://webbook.nist.gov/chemistry/

Landolt Börnstein Online Index: http://lb.chemie.uni-hamburg.de/

Enhanced National Cancer Institute Database Browser: http://129.43.27.140/ncidb2/

Sigma-Aldrich: http://www.sigmaaldrich.com/Local/SA_Splash.html

For pesticides, Compendium of Pesticide Common Names: http://www.alanwood.net/pesticides/

ChemSpider, Structure Search Database of >16 million chemical structures: http://www.chemspider.com

NCBI PubChem - Database of chemical structures and bioactivity of small molecules: http://pubchem.ncbi.nlm.nih.gov/

Commercial resources (stand-alone products or subscription services): exit EPA

ACD/Labs Dictionary (v. 10.0), offered in the commercial version of ACD/Labs ChemSketch at: http://www.acdlabs.com/products/chem_dsn_lab/chemsketch/dictionary/

ACD/Labs Name- to-Structure Software (v. 10.0): http://www.acdlabs.com/products/name_lab/name/

The Merck Index, Thirteenth Ed. (distributed by CambridgeSoft): http://www.cambridgesoft.com/databases/details/?db=1

CAS SciFinder: http://www.cas.org/products/scifindr/index.html

ChemACX listing of available commercial chemicals, distributed by CambridgeSoft's ChemOffice:
http://www.cambridgesoft.com/databases/details/?db=12

Details of SDF construction and quality assurance procedures are provided for each DSSTox Structure Data File in the NAMEID_LogFile (where NAMEID = DBPCAN, EPAFHM, etc.) provided on the corresponding DSSTox SDF Download Page (see, e.g., EPAFHM). In addition, we encourage User-submitted Error Reports.

Back to top list. Return to Top

Quality Review of New Data in DSSTox Master File:

The central DSSTox Master File is a consolidated series of Microsoft Access exit EPA tables for managing DSSTox Standard Chemical Field information, spanning all DSSTox Structure Data Files (SDF files both published and in development), and is designed to ensure consistent chemical information across the entire DSSTox chemical inventory. All quality assurance (QA) of DSSTox Standard Chemical Fields to be used for a new DSSTox SDF files initially takes place within the DSSTox Master File. As part of the Chemical Structure Identification and Annotation Procedures outlined above, these QA steps include the following:

Assignment of unique DSSTox_RID for new data file entry

Identification of duplicate CASRN and structure entries, and proper assignment of DSSTox_Generic_SID and DSSTox_CID;

Checks of DSSTox Standard Chemical Field text entries conforming to allowable field entries for all new additions to the DSSTox Master file.

Back to top list. Return to Top

DSSTox Chemical and Substance IDs:

Chemical and Substance IDs are essential features of the DSSTox Master File that are the primary means for internal referencing and insuring consistency across chemical structure and test substances:

Index each Structure with a unique Chemical ID (DSSTox_CID):

All chemical structures and structure-related fields (STRUCTURE and STRUCTURE_... content-linked fields) in the DSSTox Master File are indexed with a unique Chemical ID (DSSTox_CID). A DSSTox_CID will be the same for two DSSTox data records only if the contents of the STRUCTURE and STRUCTURE_... content-linked fields are identical (see More on DSSTox Standard Chemical Fields).

Different substances sharing the same DSSTox_CID (i.e., unique chemical structure IDs) will generally be associated with different toxicity test data. For example, a test result for a pure chemical substance will be considered distinct from a test result for a different purity grade, or where this chemical is a component of a mixture (i.e., same DSSTox_CID, different DSSTox_Generic_SID).

Index each Test Substance with a Generic Substance ID (DSSTox_Generic_SID):

A DSSTox_Generic_SID will be the same for two DSSTox data records only if the TestSubstance_... content-linked fields refer to the same generic tested substance to the best of our determination (see More on DSSTox Standard Chemical Fields). In this case, the information stored in the DSSTox Master File will be identical for the TestSubstance_... content-linked fields; only the particulars of the TestSubstance_ChemicalName field may vary from one DSSTox data file to another depending on the particular chemical name used the Source collaborator, which we strive to match for ease of referencing to the Source database.

The ChemicalNote field is used to record any particular information on mixture characterization that is Source-independent, whereas Source-dependent discrepancies in information between DSSTox data files (e.g., chemical names or CASRN) on TestSubstance_... annotation is provided in the Source-specific Note_NAMEID field.

Since we include representative structures of single compounds for many "mixture or formulation " records in the DSSTox Master File, in many cases a single DSSTox_CID will be paired with multiple DSSTox_Generic_SID 's. Hence, there are more unique SIDs than CIDs. DSSTox_Generic_SID values are included in all published DSSTox data files to facilitate location of duplicate "TestSubstance_..." records within or across DSSTox data files.

Back to top list. Return to Top

Locating Text Errors in DSSTox Standard Chemical Fields:

Historical construction of DSSTox Data Files involved a large measure of human data entry of text information, either directly (DSSTox project) or indirectly (Source-provided), with the inevitable human errors that non-automated data entry entails. Data field entry counts, consolidation of the DSSTox Standard Chemical Field portion of all DSSTox Data Files into the DSSTox Master Structure-Index File, enforcing consistency of common information across all files, and manual checks and QA reviews have detected and corrected the vast majority of these errors. More recently, however, we have employed programming scripts to perform automated checks on DSSTox Standard Chemical Field content, both past and new, requiring the text content to conform to the allowed field entry values and basic ASCII text standards. These automated text field scans have located numerous minor typographical errors and some non-standard ASCII characters from imported text field entries. Once located, these errors or inadvertant characters have been corrected. Instituting limits on DSSTox Standard Chemical Field data field entries (predefined menu choices), and performing automated text field scans on new data entry records before finalizing their entry into the DSSTox Master Structure-Index File will prevent such errors from occurring in the future.

Back to top list. Return to Top

User-Submitted Error Reports:

On each DSSTox SDF Download Page (see, e.g., EPAFHM), we provide a link to a File Error Report form for users to report any errors found in DSSTox database or Structure-Index files. Alternatively, a user can Contact Us with the appropriate error reporting information. Reported errors can pertain to any aspect of a DSSTox database publication, but primarily we are interested in correcting errors in DSSTox Standard Chemical Fields or toxicity-related fields. We have received several such reports and, after review and confirmation of the new information, wherever possible and deemed correct, we document the error in the corresponding database DSSTox Log File and incorporate the correction into an updated DSSTox database version. This mechanism serves our effort as well as the the larger DSSTox user community by providing a central forum for error-reporting and correction.

Back to top list. Return to Top

Jump to main content.