CUE

CUE: Customizable Unbiased dataset for the Evaluation of protein structure based computational methods

Server Status

Waiting jobs
0
Estimated remaining time
0h
Server Status
Offline

Help

  • About the dataset
  • Customize your dataset
  • Privacy policy

About the dataset

  1. Aim of our dataset

    Recently, reported applying machine learning models on structure based binding prediction has outperformed traditional methods like docking methods. While the datasets used to evaluation the machine learning models are not designed for machine learning models, Which allows machine learning models catch the hidden bias inside the dataset and cause overestimating.

    Our dataset was built under the assumption that in order to let machine learning models to learn the target-ligand interaction instead of the hidden bias. We need to introduce more targets and we need to unbias the ligand distrubtion in structural space.

    CUE is aiming on provide a benchmark data set which can correct evaluate structure based binding prediction using machine learning method.

  2. File tree of the dataset

    We provide the target structure PDBs and active/inactive ligands which were measured on IC50 in SMILES.

    ./...
    	metadata.csv
    	/data
    		/CHEMBLXXX(target chembl id)
    			/active.sdf
    			/inactive.sdf
    			/PXXXX(target uniprot id)
    				/xxxx.pdb
    				/xxxx.pdb
    				.
    				.
    				.
    		.
    		.
    		.
    

Customize your dataset

  1. How to use

    Please enter the dataset parameters and your email address.

    Download link of prepared dataset will send to the email address when the extraction is finished.

    The estimate extraction time is 2 hours per job.

  1. Introduce of the parameters
    • Active ligands IC50 (µM) :range from 0.10 to 50.00

      The threshold of active ligands.

    • Inactive ligands IC50 (µM): range from 10.00 to 200.00

      The threshold of inactive ligands.

    • Frequent hitters count: greater than 3

      If the a ligand is count as active ligand for more than N tagets, the ligand will be treat as frequent hitter and be removed

    • Active ligand count per target: greater than 1

      Target in the dataset need to have more than N active ligands.

    • Inactive ligand count per target: greater than 1

      Target in the dataset need to have more than N inactive ligands.

    • PDB resolution (Å): greater than 1.0

      Target in the dataset need to have more than one PDB which reselution is lower than N

    • PDB coverage: range form 0.00 to 1.00

      Target in the dataset need to have more than one PDB which structure cover more than N percent of the whole sequence.

    • E-value:

      Larger value means weaker target sequence homology between targets, but less tagrets will be include. Based on BLAST+

Privacy policy

  1. WHAT INFORMATION DO WE COLLECT?

    We do not collect any user information.

  2. WILL YOUR DATA BE SHARED WITH ANYONE?

    We will return the URL with the prediction results via email. Only people who know this URL can see the prediction results.The URL is designed to be unguessable. This URL is only available for 5 days.

  3. HOW LONG DO WE KEEP YOUR DATA?

    The input data and the prediction result data will be automatically deleted from the server in 5 days after the completion of the prediction.

  4. HOW CAN YOU CONTACT US ABOUT THIS NOTICE?

    If you have questions or comments about this notice, you may email us at cue@cb.cs.titech.ac.jp