Reader Comments

Post a new comment on this article

Flat PDF corrupts semantic data

Posted by Tobi on 05 May 2009 at 05:38 GMT

It seems silly to be the first one to comment on my own article, but just as a proof of concept, the “flat PDF” print version of this article contains corrupt InChIKeys; therefore “flat PDF” corrupts the semantic content. This is exactly what we mentioned in our article. For example in the first print PDF the Omeprazol InChIKey is written with a hyphen of death (the last one with –AZ) due to a line break in the PDF: SUBDBMMJDZJVOS-UHFFFAOY-AZ
Originally Omeprazol is coded as SUBDBMMJDZJVOS-UHFFFAOYAZ or in principle one can cut after the first hyphen, to obtain the core structure, but one should never cut in between. (its always 14 letters – 10 letters). So searching google with the corrupt key gives zero hits, and with the correct key results in 7 hits.

The good thing is that the article is also available as XML and HTML and OA and we submitted all chemical structures to the supplement as SDF and InChIKey and MOL and also to our website and we can annotate the article also in other databases. And yes PDF can contain annotation data as XML as “enriched PDF”, but not many professional publishers use XML PDF to attach chemical structure data and spectra. I guess the wizards from Adobe or Arbortext's Advanced Print Publisher (3B2) would know that...

See also some related older blog comments:
How can we publish semantic chemical documents?
http://wwmm.ch.cam.ac.uk/...

Approaches to compound documents - ORE, PDF, DOCX
http://wwmm.ch.cam.ac.uk/...

Tobias (lead author)

No competing interests declared.