Dataset Identification:

Resource Abstract:
description: Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publically available PHYSPROP physico-chemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers, and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest quality subset of the original dataset was compared to the larger curated and corrected data set. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publically available for further usage and integration by the scientific community. This dataset is associated with the following publication: Mansouri, K., C. Grulke, A. Richard, R. Judson, and A. Williams. (SAR AND QSAR IN ENVIRONMENTAL RESEARCH) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modeling. SAR AND QSAR IN ENVIRONMENTAL RESEARCH. Taylor & Francis, Inc., Philadelphia, PA, USA, 27(11): 911-937, (2016).; abstract: Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publically available PHYSPROP physico-chemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers, and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest quality subset of the original dataset was compared to the larger curated and corrected data set. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publically available for further usage and integration by the scientific community. This dataset is associated with the following publication: Mansouri, K., C. Grulke, A. Richard, R. Judson, and A. Williams. (SAR AND QSAR IN ENVIRONMENTAL RESEARCH) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modeling. SAR AND QSAR IN ENVIRONMENTAL RESEARCH. Taylor & Francis, Inc., Philadelphia, PA, USA, 27(11): 911-937, (2016).
Citation
Title Judson_Mansouri_Automated_Chemical_Curation_QSAREnvRes_Data.
creation  Date   2017-10-03T17:18:51.216590
Resource language:
Processing environment:
Back to top:
Digital Transfer Options
Metadata data stamp:  2018-08-07T00:02:58Z
Resource Maintenance Information
maintenance or update frequency:
notes: This metadata record was generated by an xslt transformation from a dc metadata record; Transform by Stephen M. Richard, based on a transform by Damian Ulbricht. Run on 2018-08-07T00:02:58Z
Metadata contact - pointOfContact
organisation Name  CINERGI Metadata catalog
Contact information
Address
electronic Mail Addresscinergi@sdsc.edu
Metadata language  eng
Metadata character set encoding:   utf8
Metadata standard for this record:  ISO 19139 Geographic Information - Metadata - Implementation Specification
standard version:  2007
Metadata record identifier:  urn:dciso:metadataabout:7a3dab95-33f2-4208-8968-cdbdfe43db5b

Metadata record format is ISO19139 XML (MD_Metadata)