Computational Toxicology Research Program

SDF Download Page

GEOGSE: National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) Series Experiments
Structure-Index Locator File

** Updated Version 2a DSSTox Structure-Index Locator File, 09 March 2009 (Source website content extracted 02Feb2009)

For additional information on DSSTox SDF (Structure Data Format) files and their use in Chemical Relational Databases, see More on SDF and More on CRDs.

Description: The National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) Seriesexit EPA is a public user-depositor data repository for the public use and dissemination of gene expression data generated by high-throughput methodologies. GEO supports use of MIAME exit EPAguidelines in accordance with the Microarray Gene Expression Data Society (MGED) exit EPArecommendations. Each GEO Series defines a set of related samples considered to be part of a single experiment and these Series are the focus of the present chemical indexing efforts [Note that GEO Series also most closely correspond to the ArrayExpress Repository experiments in the related DSSTox file, ARYEXP].  GEO currently contains over 9900 Series Accession entries. In GEO, raw and/or processed data can be exported through the ftp site as well as through the main GEO Series website exit EPA. User information, however, is entered using a free-text format that is subsequently curated. GEO allows for a wide range of informed queries with the Preview/Index window, where users can select data based on choices for each attribute of the experiment. GEO is not chemically indexed nor does it consistently contain information about the chemical tested.

NCBI GEO and the European Bioinformatics Institute (EBI) ArrayExpress Repository exit EPA(see accompanying DSSTox file, ARYEXP) are the two main public respositories of gene expression data and microarray experiments associated with the scientific literature. Deposition of data into one of these two resources is now a precondition and standard requirement for journal publication of microarray studies. At the time of this writing, neither resource has standard requirements for reporting of chemical information associated with submitter-deposited experiments. As a result, until now it has been difficult to assess the chemical-related content or, more specifically, the chemical exposure-related content in these resources such that microarray experiments have been isolated from other public sources of chemically-indexed information pertaining to toxicology. This DSSTox project was undertaken to use chemical information linkages to contribute to building a public toxicogenomics capability and to encourage the application of structure-activity relationship (SAR) concepts to gene expression data where sufficient comparable experiments on chemical analogs are available.

The DSSTox GEOGSE data file is a chemical-index file of unique chemical substances pertaining to the chemical exposure-related experimental content (identified by us as Chemical_StudyType ="Treatment") within the GEO Series Repository as of the date of data extraction (see Note). The chemical exposure-related content of the GEO Series Repository was identified through a series of automated methods using NCBI Entrez Programming Utilities (E-Utilities) exit EPA, an XML version of the chemical Medical Subject Headings (MeSH) library exit EPA, and a series of custom Perl scripts to parse through a complete XML version of the GEO Series database.  These automated methods, however, were insufficient and had to be supplemented by extensive manual curation and review of the chemical content extracted from GEO Series fields and free text description submitter entries (Williams-Devane et al. 2009). The final DSSTox GEOGSE file contains the full complement of DSSTox Standard Chemical Fields for each unique substance, as well as URL link(s) to one or more chemical-specific Experiment_Accession number data page(s) within the GEO Series Repository. All GEO Series Experiment Accession numbers pertaining to the same chemical substance (i.e., the same DSSTox_Generic_SID) are listed in the Experiment_Accession field in the same GEOGSE chemical record.

The DSSTox GEOGSE chemical index file has been incorporated into the DSSTox Structure-Browser, and deposited into PubChemexit EPA, enabling a user to locate particular chemical-associated experiments or those associated with close chemical analogs through a structure similarity search.

GEOGSE Auxiliary Data File: During the course of this project, a large amount of chemical-associated information was initially curated from the full GEO Series Repository that is of potential use for toxicogenomics investigations. Prior to identifying chemical exposure-related GEO content (i.e., Treatment vs. other uses, such as Reference, Vehicle, Media, etc), we created a full listing of GEO Series Repository chemical-experiment pairs (i.e., one record per Series Accession number, with some DSSTox_Generic_SID substances spanning multiple records and experiments), along with a full complement of summary experimental descriptors and indices provided by GEO. These summary experimental fields include species, array type, number of samples, etc. This content is contained in the Auxiliary Data File (GEOGSE_Aux) offered in the Download Table below in SD or table format. The file contains the full complement of DSSTox Standard Chemical Fields, 14 Standard Genomics Fields (for more information see Main Citation) and 4 Source-specific content fields from GEO experiment annotations (an MS Word doc file listing fields and their definitions is also included in the Download Table below). The content of these files will be incorporated, along with the ARYEXP files, into the Chemical Effects in Biological Systems (CEBS) database exit EPA and are being provided to the NCBI GEO project in the hopes of improving chemical annotation and data linkages of public gene expression resources in the future.

Source Website: NCBI GEO is located online at http://www.ncbi.nlm.nih.gov/geo/.exit EPA

Note: The NCBI GEO Series repository content is regularly updated; the DSSTox GEOGSE content represents a snapshot of the chemical exposure-related content of that repository extracted on 02Feb2009. We will pursue options for further updates and post future plans on this webpage.


Source Contact: Contact NCBI GEO staff at geo@ncbi.nlm.nih.gov.

Main Citations: For more information on this project and procedures used to extract data and chemically annotate gene expression experiments in the two main public repositories, ArrayExpress and GEO, see:

Williams-Devane, C.R., M.A. Wolf, and A.M. Richard (2009) DSSTox Chemical-index Files for Exposure-Related Experiments in ArrayExpress and Gene Expression Omnibus: Enabling Toxico-chemogenomics Data Linkages, Bioinformatics, 25:692-694. exit EPA

Download PDF

Williams-Devane, C.R., M.A. Wolf, and A.M. Richard (2009) Towards a public toxicogenomics capability for supporting predictive toxicology:
Survey of current resources and chemical indexing of experiments in GEO and ArrayExpress, Toxicology Sciences, in press. exit EPA

Guidance for Use: GEOGSE represents a departure from previously published DSSTox data files, which either contain toxicology data of potential use for structure-activity relationship (SAR) modeling, or are high-interest chemical inventories for environmental toxicology from the EPA or National Toxicology Program. GEOGSI and ARYEXP are the first DSSTox files to chemically index public repositories of microarray experiments of potential use for toxicogenomics investigations. The DSSTox GEOGSE file is an inventory of unique chemical substances, with each chemical mapped to one or more experiments contained within the GEO Series Repository and, in each case, chemical exposure (or treatment) is deemed a primary objective of the experiment. The file was created to encourage consideration of chemical structure and chemical similarity as an organizing principle for such data, to aid in association of common gene expression patterns, and to aid in the aggregation of multiple data types for potential toxicogenomics investigation. Users should be aware that the chemically indexed experimental content of the public GEO Series Repository spans a large diversity of treatment conditions, species, array types, data annotation, laboratories, etc. Hence, data aggregation by chemical or chemical similarity must also consider and attempt to control for these many variables in a public repository. An auxiliary data file, GEOGSE_Aux, is offered for download that includes a larger set of chemical-experiment pairs (including all categories of chemical experiment association) for the GEO Series Repository and 4 additional data fields.

GEOGSE SDF Fields (26 total):

DSSTox Standard Chemical Fields (20)


Source_ChemicalName new field added Feb2009


e.g. Acrylamide: GSE9357

Version 2 Update: GEOGSE_v2a and GEOGSE_Aux_v2a contain updated content extracted from the GEO website exit EPA as of 02Feb2009 (v1a corresponded to data extraction on 20Sep008). Method of data extraction and file construction is documented in the Main Citations. A total of 319 new chemical-experiment pairs were identified and were included in the updated GEOGSE_Aux_v2a file. Of these, 308 were labeled by us as chemical "treatment" experiments, and these new experiments correspond to 65 new unique chemicals associated with "treatment" experiments. Hence, 308 new chemical treatment experiment links are provided in PubChem (for a total of 2442 PubChem chemical-experiment pair entries), 65 new chemicals with links to one or more experiments have been added to the GEOGSE_v2a structure-index file, and a total of 308 new URLs to experiments were added. The chemical content totals for GEOGSE_v1a and v2a are summarized in the table below.

Whereas in v1a, the field TestSubstance_ChemicalName was used to store the Source-provided chemical name obtained from the ArrayExpress experimental record (with all abbreviations and sometimes errors), in v2a (and in all DSSTox files posted after Jan09), this Source-provided chemical name has been moved to a new field Source_ChemicalName. The Standard Chemical Field, TestSubstance_ChemicalName, now carries a default, quality-reviewed chemical name used for this substance (DSSTox_Generic_SID) across all DSSTox files (this can be a common, generic or trade name). Structure_InChI and Structure_InChIKey codes have been updated to correspond to the newly published NIST recommended standard InChI options (see http://www.epa.gov/ncct/dsstox/MoreonInChI.html#InChIDSSTox).  

For more information and version history, and to locate specific updated chemical records, consult the GEOGSE_LogFile in the Download Table below and version update entries in the Note_GEOGSE field.


GEOGSE SDF Content Summary - 09 March 2009

Totals_v1a Totals_v2a
# Unique Chemical Records
DSSTox Standard Chemical Fields
DSSTox Standard Toxicity Fields
GEOGSE Source Fields
Total # Fields
Total # Treatment Experiment Accession IDs*
Chemical Content
Counts_v1a Counts_v2a
defined organic
no structure
salt complex
single chemical compound
mixture or formulation

* Note:  Total includes replicate Experiment Accession IDs and corresponds to unique chemical-experiment pairs, which includes many cases where the same Experiment Accession ID is mapped to different unique chemicals (i.e., experiment/study tested many chemicals).

File Download Notes: The following files are offered in the DownLoad table below:

Structure Data File (SDF) is the main DSSTox product, providing the complete inventory of chemical structures, DSSTox Standard Chemical Fields, and all Source-specific data fields [Note: the structure field is blank for all records containing mixtures or undefined substances];
Data Table MS Excel (MS Office 2003) file contains the full SDF data contents in spreadsheet table form, minus the chemical structure field [file created with CambridgeSoft ChemFinder plug-in to MS Excel 2003];
Structures Table (PDF) file contains a tiled format graphical view of all chemical structures contained in the SDF file, annotated with TestSubstance_CASRN and truncated TestSubstance_ChemicalName field entries for the tested form of the chemical [file created with ACD ChemFolder, ver. 11.00, ACD Labs].

File Types   Description File Size Format

Documentation & Data Files: GEOGSE
Log File  
pdf document icon
SDF Structure Data File   1.5 MB SDF icon
• Data Table
(no structures)
  Excel xls icon
pdf document icon
Documentation & Data Files: GEOGSE_Aux
SDF Structure Data File   5.1 MB Included in Zip file.
• Data Table
(no structures)
  Included in Zip file.
pdf document icon
Field Definitions   46 KB
These files constitute the main DSSTox products. DSSTox Structure Data Files and DSSTox File Names adhere to strict formatting standards and conventions.

Acknowledgements: All QA review, corrections to submitter chemical information, and structure annotation were carried out by Maritja Wolf (Lockheed Martin, Contractor for EPA). We thank Jennifer Fostel (NIEHS CEBS) and Chihae Yang (Ohio State University) for their helpful comments in the review of this work. We also thank Tom Transue (Lockheed Martin, Contractor for EPA) for assistance with loading of GEOGSE into the DSSTox Structure-Browser and QA review assistance. Updated files were created by ClarLynda Williams and Maritja Wolf.

DSSTox Citation:
Williams-Devane, C.R., M.A. Wolf, and A.M. Richard (2009) DSSTox National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) Series Experiments (GEOGSE and GEOGSE_Aux): SDF Files and Documentation, Updated versions: GEOGSE_v2a_1179_09Mar2009, GEOGSE_Aux_v2a_2700_09Mar2009, www.epa.gov/ncct/dsstox/sdf_geogse.html

Disclaimer: Every effort is made to ensure that DSSTox SDF files and associated documentation are error-free, but neither the DSSTox Source collaborators nor the EPA DSSTox project team make guarantees of accuracy, nor are any of these persons to be held liable for any subsequent use of these public data. The contents of this webpage and supporting documents have been subjected to review by the EPA National Center for Computational Toxicology and approved for publication. Approval does not signify that the contents reflect the views of the Agency, nor does mention of trade names or commercial products constitute endorsement or recommendation for use. See additional disclaimers.


