MATLAB - How do I extract the molecules from an SDF file with an ID in another text file, into a new SDF file?

113 Views Asked by At

I have an SDF file with thousands of molecules and several text files of ID's grouped together by certain characteristics. Right now, I have a script that loads in an CSV database with the features of the molecules and generates the ID text files by classifying based on these features. I want to use these text files to parse the SDF file to get new SDF files with the corresponding molecules. In addition, I want to do this in MATLAB.

For example, here are some molecules in the original SDF file:

NCGC00178831-03
  Marvin  07111412562D          

 34 37  0  0  0  0            999 V2000
    4.8814   -2.7443    0.0000 Cl  0  5  0  0  0  0  0  0  0  0  0  0
    2.8647   -2.4751    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8647   -1.6501    0.0000 N   0  3  0  0  0  0  0  0  0  0  0  0
    3.5808   -1.2318    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2970   -1.6501    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.0017   -1.2318    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.7179   -1.6501    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.0017   -0.4068    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2970    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5808   -0.4068    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8647    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.1485   -0.4068    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.1485   -1.2318    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4324   -1.6501    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7162   -1.2318    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000   -1.6501    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.7162   -0.4068    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4324    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8761   -3.5407    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    3.5923   -3.9590    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.3084   -3.5407    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.0132   -3.9590    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.7293   -3.5407    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.0132   -4.7840    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.3084   -5.1908    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5923   -4.7840    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8761   -5.1908    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.1599   -4.7840    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.1599   -3.9590    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4438   -3.5407    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7276   -3.9590    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0115   -3.5407    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.7276   -4.7840    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4438   -5.1908    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  3  1  0  0  0  0
  3  4  2  0  0  0  0
  3 13  1  0  0  0  0
  4  5  1  0  0  0  0
  4 10  1  0  0  0  0
  5  6  2  0  0  0  0
  6  7  1  0  0  0  0
  6  8  1  0  0  0  0
  8  9  2  0  0  0  0
  9 10  1  0  0  0  0
 10 11  2  0  0  0  0
 11 12  1  0  0  0  0
 12 13  2  0  0  0  0
 12 18  1  0  0  0  0
 13 14  1  0  0  0  0
 14 15  2  0  0  0  0
 15 16  1  0  0  0  0
 15 17  1  0  0  0  0
 17 18  2  0  0  0  0
 19 20  2  0  0  0  0
 19 29  1  0  0  0  0
 20 21  1  0  0  0  0
 20 26  1  0  0  0  0
 21 22  2  0  0  0  0
 22 23  1  0  0  0  0
 22 24  1  0  0  0  0
 24 25  2  0  0  0  0
 25 26  1  0  0  0  0
 26 27  2  0  0  0  0
 27 28  1  0  0  0  0
 28 29  2  0  0  0  0
 28 34  1  0  0  0  0
 29 30  1  0  0  0  0
 30 31  2  0  0  0  0
 31 32  1  0  0  0  0
 31 33  1  0  0  0  0
 33 34  2  0  0  0  0
M  CHG  2   1  -1   3   1
M  END
>  <Formula>
C27H25ClN6

>  <FW>
468.9806 (35.4535+224.2805+209.2465)

>  <DSSTox_CID>
25848

>  <SR-HSE>
0

$$$$
NCGC00166114-03
  Marvin  07111412562D          

 31 32  0  0  0  0            999 V2000
    4.9884   -1.2417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.9884   -2.0696    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2748   -2.4764    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2748   -3.7038    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.9884   -4.1178    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.7021   -3.7038    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.4157   -4.1178    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
    5.7021   -2.8760    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.9884   -4.9385    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2748   -5.3524    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5612   -4.9385    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5612   -4.1178    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5612   -2.0696    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5612   -1.2417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2748   -0.8279    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.8403   -0.8279    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.1267   -1.2417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.1267   -2.0696    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8403   -2.4764    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4202   -2.4764    0.0000 Br  0  0  0  0  0  0  0  0  0  0  0  0
    1.4202   -0.8279    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
    2.8403    0.0000    0.0000 Br  0  0  0  0  0  0  0  0  0  0  0  0
    5.7021   -2.4764    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.4229   -2.0696    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.4229   -1.2417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.7021   -0.8279    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.7021    0.0000    0.0000 Br  0  0  0  0  0  0  0  0  0  0  0  0
    7.1366   -0.8279    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    7.1366   -2.4764    0.0000 Br  0  0  0  0  0  0  0  0  0  0  0  0
    7.0866   -4.1963    0.0000 Na  0  3  0  0  0  0  0  0  0  0  0  0
    0.0000   -0.7708    0.0000 Na  0  3  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1 15  1  0  0  0  0
  1 26  2  0  0  0  0
  2  3  2  0  0  0  0
  2 23  1  0  0  0  0
  3  4  1  0  0  0  0
  3 13  1  0  0  0  0
  4  5  2  0  0  0  0
  4 12  1  0  0  0  0
  5  6  1  0  0  0  0
  5  9  1  0  0  0  0
  6  7  1  0  0  0  0
  6  8  2  0  0  0  0
  9 10  2  0  0  0  0
 10 11  1  0  0  0  0
 11 12  2  0  0  0  0
 13 14  2  0  0  0  0
 13 19  1  0  0  0  0
 14 15  1  0  0  0  0
 14 16  1  0  0  0  0
 16 17  2  0  0  0  0
 16 22  1  0  0  0  0
 17 18  1  0  0  0  0
 17 21  1  0  0  0  0
 18 19  2  0  0  0  0
 18 20  1  0  0  0  0
 23 24  2  0  0  0  0
 24 25  1  0  0  0  0
 24 29  1  0  0  0  0
 25 26  1  0  0  0  0
 25 28  2  0  0  0  0
 26 27  1  0  0  0  0
M  CHG  4   7  -1  21  -1  30   1  31   1
M  END
>  <Formula>
C20H6Br4Na2O5

>  <FW>
691.8542 (645.8757+22.9892+22.9892)

>  <DSSTox_CID>
5234

>  <SR-HSE>
0

$$$$
NCGC00263563-01
  Marvin  07111412562D          

 71 76  0  0  1  0            999 V2000
    2.1953   -4.9878    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
    3.6803   -4.9878    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
    2.9701   -5.4074    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    6.5858   -4.9878    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
    5.1008   -4.9878    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
    2.1953   -4.1484    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
   11.8157   -5.6335    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   14.1239   -5.8755    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   11.0893   -5.1008    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
    3.6803   -4.1484    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
   10.2015   -5.1008    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
   12.5905   -5.1653    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
   14.9633   -5.8755    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
    4.3905   -5.4074    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
    5.8755   -5.4074    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.9701   -3.6803    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
   11.4606   -4.3905    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
   13.6558   -5.1653    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
    9.5559   -5.5043    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    7.2476   -5.5043    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.1008   -4.1484    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
    1.4850   -5.4074    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   11.8157   -2.4858    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.9578   -4.9878    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
    6.5858   -4.1484    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   12.5905   -2.9055    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   12.3483   -4.3905    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   11.8157   -1.6626    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.8755   -3.6803    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
   13.3008   -1.6626    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
   12.5905   -1.2429    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
   13.3008   -2.4858    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
    8.8457   -4.9878    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
   11.4606   -3.1961    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   14.1239   -4.5035    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
    0.7748   -4.9878    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   15.4314   -5.2137    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
   14.9633   -4.5035    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    9.9756   -4.2776    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000   -5.4074    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
    7.6673   -4.2776    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.1953   -5.7464    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    6.8764   -4.2776    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    9.0877   -4.2776    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7748   -4.1484    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   14.5437   -6.4567    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.6803   -3.3736    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.9701   -2.9055    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.8755   -2.9055    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   14.0110   -1.2429    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   12.5905   -0.4197    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.4850   -3.6803    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   15.5444   -6.4082    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   10.5566   -4.3905    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.3905   -6.1177    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.5035   -3.7933    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.1838   -4.2776    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   14.0110   -2.9055    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   13.6558   -3.7449    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   16.1416   -5.2137    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2130   -2.9701    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.1953   -2.3729    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   14.7858   -1.6626    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   13.3008    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   11.0893   -5.8755    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   12.5905   -5.9885    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    8.8941   -5.7464    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    3.6803   -5.7464    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    5.1008   -5.7464    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   13.6558   -5.9885    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    0.4681   -6.7634    0.0000 N   0  3  0  0  0  0  0  0  0  0  0  0
  1  3  1  0  0  0  0
  1  6  1  0  0  0  0
  1 22  1  6  0  0  0
  1 42  1  1  0  0  0
  2  3  1  0  0  0  0
  2 14  1  0  0  0  0
  2 68  1  1  0  0  0
  2 10  1  0  0  0  0
  4 15  1  0  0  0  0
  4 20  1  1  0  0  0
  4 43  1  0  0  0  0
  4 25  1  0  0  0  0
  5 14  1  0  0  0  0
  5 15  1  0  0  0  0
  5 21  1  0  0  0  0
  5 69  1  1  0  0  0
  6 16  1  0  0  0  0
  6 52  1  1  0  0  0
  7  9  1  0  0  0  0
  7 12  1  0  0  0  0
  8 18  1  0  0  0  0
  8 13  1  0  0  0  0
  9 11  1  0  0  0  0
  9 17  1  0  0  0  0
  9 65  1  6  0  0  0
 10 16  1  0  0  0  0
 10 47  1  1  0  0  0
 11 19  1  0  0  0  0
 11 54  1  6  0  0  0
 11 39  1  0  0  0  0
 12 18  1  0  0  0  0
 12 66  1  1  0  0  0
 12 27  1  0  0  0  0
 13 46  1  1  0  0  0
 13 53  1  6  0  0  0
 13 37  1  0  0  0  0
 14 55  1  1  0  0  0
 16 48  1  6  0  0  0
 17 27  1  0  0  0  0
 17 34  1  1  0  0  0
 18 35  1  0  0  0  0
 18 70  1  1  0  0  0
 19 33  1  0  0  0  0
 20 24  1  0  0  0  0
 21 29  1  0  0  0  0
 21 56  1  6  0  0  0
 22 36  1  0  0  0  0
 23 34  1  0  0  0  0
 23 26  1  0  0  0  0
 23 28  1  0  0  0  0
 24 33  1  0  0  0  0
 24 57  1  6  0  0  0
 24 41  1  0  0  0  0
 25 29  1  0  0  0  0
 26 32  1  0  0  0  0
 28 31  1  0  0  0  0
 29 49  1  1  0  0  0
 30 31  1  0  0  0  0
 30 50  1  1  0  0  0
 30 32  1  0  0  0  0
 31 51  1  6  0  0  0
 32 58  1  6  0  0  0
 33 44  1  0  0  0  0
 33 67  1  6  0  0  0
 35 38  1  0  0  0  0
 35 59  1  1  0  0  0
 36 40  1  0  0  0  0
 36 45  2  0  0  0  0
 37 38  1  0  0  0  0
 37 60  1  1  0  0  0
 39 44  1  0  0  0  0
 41 43  1  0  0  0  0
 47 61  1  0  0  0  0
 48 62  1  0  0  0  0
 50 63  1  0  0  0  0
 51 64  1  0  0  0  0
M  CHG  2  40  -1  71   1
M  END
>  <Formula>
C47H83NO17

>  <FW>
934.1584 (916.1205+18.0379)

>  <DSSTox_CID>
28909

>  <SR-HSE>
0

$$$$

And here are some ID's from the text file:

NCGC00015959-03
NCGC00168261-01
NCGC00257010-01
NCGC00254654-01
NCGC00254471-01

The generated SDF file should start like this:

NCGC00015959-03
  Marvin  07111412562D          

 25 30  0  0  0  0            999 V2000
    3.4098   -1.3130    0.0000 N   0  3  0  0  0  0  0  0  0  0  0  0
    4.8329   -1.3130    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.4098   -2.1380    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.1248   -2.5436    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.6948   -2.5436    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.8329   -2.1380    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.1248   -0.8937    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.5547   -0.8937    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.9799   -2.1380    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.6948   -3.3548    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2718   -2.5436    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2718   -3.3548    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.1248   -3.3548    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.9799   -3.7741    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.5547   -2.5436    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.2765   -1.3130    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.7128   -0.0894    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.4881   -2.2755    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.4881   -3.6160    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    6.8746   -0.7562    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    6.5378    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000   -2.9423    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.4098   -3.7741    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.2765   -2.1380    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.6948   -0.8937    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  3  1  0  0  0  0
  1  7  2  0  0  0  0
  1 25  1  0  0  0  0
  2  7  1  0  0  0  0
  2  6  2  0  0  0  0
  2  8  1  0  0  0  0
  3  4  2  0  0  0  0
  3  5  1  0  0  0  0
  4 13  1  0  0  0  0
  4  6  1  0  0  0  0
  5  9  1  0  0  0  0
  5 10  2  0  0  0  0
  6 15  1  0  0  0  0
  8 16  2  0  0  0  0
  8 17  1  0  0  0  0
  9 11  2  0  0  0  0
 10 14  1  0  0  0  0
 10 23  1  0  0  0  0
 11 18  1  0  0  0  0
 11 12  1  0  0  0  0
 12 14  2  0  0  0  0
 12 19  1  0  0  0  0
 13 23  2  0  0  0  0
 15 24  2  0  0  0  0
 16 20  1  0  0  0  0
 16 24  1  0  0  0  0
 17 21  1  0  0  0  0
 18 22  1  0  0  0  0
 19 22  1  0  0  0  0
 20 21  1  0  0  0  0
M  CHG  1   1   1
M  END
>  <Formula>
C20H14NO4

>  <FW>
332.3289

>  <DSSTox_CID>
25204

>  <NR-AR>
0

>  <NR-ER-LBD>
1

>  <NR-AhR>
1

$$$$
NCGC00168261-01
  Marvin  07111412562D          

 23 25  0  0  0  0            999 V2000
    2.1236   -2.4895    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4205   -2.0662    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.1236   -3.3074    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4205   -3.7235    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.7174   -2.4895    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7174   -3.3074    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8554   -2.0662    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000   -2.0662    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4205   -1.2412    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8554   -3.7235    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5656   -2.4895    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5656   -3.3074    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8554   -1.2412    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.7174   -0.8251    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000   -1.2412    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0430   -2.8984    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7174   -4.1324    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2902   -3.7378    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7174    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.0292   -3.3145    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.4569   -3.3360    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.7538   -3.7378    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.1743   -3.7378    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  2  0  0  0  0
  1  7  1  0  0  0  0
  2  5  2  0  0  0  0
  2  9  1  0  0  0  0
  3  4  1  0  0  0  0
  3 10  1  0  0  0  0
  4  6  1  0  0  0  0
  5  8  1  0  0  0  0
  5  6  1  0  0  0  0
  6 16  1  0  0  0  0
  6 17  1  0  0  0  0
  7 11  2  0  0  0  0
  7 13  1  0  0  0  0
  8 15  2  0  0  0  0
  9 14  2  0  0  0  0
 10 12  2  0  0  0  0
 11 12  1  0  0  0  0
 12 18  1  0  0  0  0
 14 15  1  0  0  0  0
 14 19  1  0  0  0  0
 18 20  1  0  0  0  0
 20 22  1  0  0  0  0
 21 22  1  0  0  0  0
 21 23  1  0  0  0  0
M  END
>  <Formula>
C21H26O2

>  <FW>
310.4299

>  <DSSTox_CID>
28922

>  <NR-AR>
0

>  <NR-AhR>
1

>  <SR-MMP>
1

$$$$
NCGC00257010-01
  Marvin  07111412562D          

 35 37  0  0  0  0            999 V2000
    2.0286   -3.5779    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.0019   -7.8578    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.0019   -0.7019    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8589   -3.5779    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    1.6092   -2.8589    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    1.6092   -4.2799    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    3.2784   -4.2799    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    6.5825   -7.1217    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.5825   -1.4381    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.3681   -3.5779    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.5024   -3.5779    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.5024   -4.9989    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0915   -4.2799    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.3412   -3.5779    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.3412   -4.9989    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7704   -4.2799    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7704   -2.8589    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.7294   -1.1385    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
    6.2829   -0.2996    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
    7.7294   -7.4213    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
    7.4384   -8.5597    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
    6.2829   -8.2601    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
    7.4384    0.0000    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
    7.0019   -2.1485    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.0019   -6.4112    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.7607   -1.4381    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.7607   -7.1217    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.7607   -5.7008    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.7607   -2.8589    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.5825   -5.7008    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.5825   -2.8589    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.3412   -6.4112    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.3412   -2.1485    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000   -2.9103    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0086   -4.2542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  4  2  0  0  0  0
  1  5  1  0  0  0  0
  1  6  1  0  0  0  0
  2  8  1  0  0  0  0
  2 20  1  0  0  0  0
  2 21  1  0  0  0  0
  2 22  1  0  0  0  0
  3  9  1  0  0  0  0
  3 18  1  0  0  0  0
  3 19  1  0  0  0  0
  3 23  1  0  0  0  0
  4  7  1  0  0  0  0
  5 17  1  0  0  0  0
  6 16  1  0  0  0  0
  7 13  2  0  0  0  0
  8 27  1  0  0  0  0
  8 25  2  0  0  0  0
  9 26  2  0  0  0  0
  9 24  1  0  0  0  0
 10 16  1  0  0  0  0
 10 34  1  0  0  0  0
 10 35  1  0  0  0  0
 10 17  1  0  0  0  0
 11 13  1  0  0  0  0
 11 14  2  0  0  0  0
 12 13  1  0  0  0  0
 12 15  2  0  0  0  0
 14 29  1  0  0  0  0
 15 28  1  0  0  0  0
 24 31  2  0  0  0  0
 25 30  1  0  0  0  0
 26 33  1  0  0  0  0
 27 32  2  0  0  0  0
 28 30  2  0  0  0  0
 28 32  1  0  0  0  0
 29 31  1  0  0  0  0
 29 33  2  0  0  0  0
M  END
>  <Formula>
C25H24F6N4

>  <FW>
494.4753

>  <DSSTox_CID>
3868

>  <NR-AR>
0

>  <NR-ER>
1

>  <NR-AhR>
1

$$$$

I've seen this post: Extract molecules in order from SDF file according to IDs given in another file that offers a solution in unix to solve this. I've used that workaround in the command line: awk 'BEGIN{ORS="$$$$"}NR==FNR{a[$1]=$0;next}$1 in a' ids.txt RS="$" molecules.sdf > molecules_by_ids.sdf and was able to get most of what I wanted. However, even when I use this command-line option I am not able to get 100% of the molecules extracted from the SDF file. For example, there are 981 molecules positive for one of the features, the text file gets 981 ID's, and this command gives me 950 molecules in the SDF file.

What I really want is a MATLAB solution that does not miss any of the molecules in the generated file. I appreciate any efforts to make a solution. Thanks!

1

There are 1 best solutions below

0
Math Simp On

A workaround I found in MATLAB is the following function, where "id" is the name of the ID TXT file, "sdfs" is the SDF database, and "sdf_name" is the name of the new SDF file with the molecules extracted by ID:

function write_sdf(id, sdfs, sdf_name)
% Open the text file of ids.
fid = fopen(id);

% Convert the sdf file to a character array.
data = fileread(sdfs);

% For each id, get the portion of the sdf file corresponding
% to the molecule id.
while true
    mol_id = fgetl(fid);
    mol_full = '';

    % When we're at the end of the file, leave the loop.
    if mol_id == -1
        % We're done with the id file.
        fclose(fid);
        break;
    else
        mol_after = extractAfter(data, mol_id);
        mol_between = extractBefore(mol_after, '$$$$');
        mol_full = [char(mol_id) char(mol_between) '$$$$'];

        % Write the molecule to the sdf file.
        writelines(mol_full, sdf_name, WriteMode='append');
    end
 end

end

The problem with this solution is that it is VERY slow. If someone knows a faster way to do this please let me know! For now, I will be using this.