Dataset Identification:

Resource Abstract:

description: Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publically available PHYSPROP physico-chemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers, and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest quality subset of the original dataset was compared to the larger curated and corrected data set. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publically available for further usage and integration by the scientific community. This dataset is associated with the following publication: Mansouri, K., C. Grulke, A. Richard, R. Judson, and A. Williams. (SAR AND QSAR IN ENVIRONMENTAL RESEARCH) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modeling. SAR AND QSAR IN ENVIRONMENTAL RESEARCH. Taylor &amp; Francis, Inc., Philadelphia, PA, USA, 27(11): 911-937, (2016).; abstract: Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publically available PHYSPROP physico-chemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers, and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest quality subset of the original dataset was compared to the larger curated and corrected data set. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publically available for further usage and integration by the scientific community. This dataset is associated with the following publication: Mansouri, K., C. Grulke, A. Richard, R. Judson, and A. Williams. (SAR AND QSAR IN ENVIRONMENTAL RESEARCH) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modeling. SAR AND QSAR IN ENVIRONMENTAL RESEARCH. Taylor &amp; Francis, Inc., Philadelphia, PA, USA, 27(11): 911-937, (2016).

Citation

Title Judson_Mansouri_Automated_Chemical_Curation_QSAREnvRes_Data.
creation Date 2017-10-03T17:18:51.216590

Resource language:

Processing environment:

Resource distribution information

Digital Transfer Options

Metadata Information

Metadata data stamp: 2018-08-07T00:02:58Z

Resource Maintenance Information

maintenance or update frequency:
notes: This metadata record was generated by an xslt transformation from a dc metadata record; Transform by Stephen M. Richard, based on a transform by Damian Ulbricht. Run on 2018-08-07T00:02:58Z

Metadata contact - pointOfContact

organisation Name CINERGI Metadata catalog

Contact information

Address


electronic Mail Address cinergi@sdsc.edu

Metadata language eng

Metadata character set encoding: utf8

Metadata standard for this record: ISO 19139 Geographic Information - Metadata - Implementation Specification

standard version: 2007

Metadata record identifier: urn:dciso:metadataabout:7a3dab95-33f2-4208-8968-cdbdfe43db5b

Judson_Mansouri_Automated_Chemical_Curation_QSAREnvRes_Data.

Dataset Identification:

Resource distribution information

Metadata Information

Metadata record format is ISO19139 XML (MD_Metadata)