Category:Chembox validation

From WikiChem
Jump to: navigation, search

This category is used as a repository for files associated with curation & validation of Wikipedia chembox content.

These are some of the main files:

  1. The main SDF file that Tony has, which has had most (all?) of its content validated. Let's call this WPChemMain2008. This is the one we've been working through to validate ChemBox content.
  2. The inorganics file that PC has put together. Let's call this WPChemInorganics2008
  3. The CAS file of 7800 (Jim's file from October, that we currently only have in XML):
  4. The intersection file of all three put together, using structures as the check:
  5. The Excel file that I've been working on - essentially a manual version of #4, but which I've found very useful for catching things being missed by the scripts:
    • Media:CAS-Wikipedia-Intersection-Dec2008.xls.zip‎ -- already out-of-date
    • Scary one-liner to convert it to a generic XML database format:
      perl -n -e 'BEGIN {$/="\r";$_=<>; chomp; @x=split /\t/; print "<ItemList>\n"; $row=0}; END {print "</ItemList>\n"};chomp; s/"//g; @y=split /\t/; print " <Item row=\"",++$row,"\">\n",(map {"  <$_>".(shift @y)."</$_>\n"} @x), " </Item>\n"' CAS_Wikipedia_Intersection_Dec2008.txt > CAS_Wikipedia_Intersection_Dec2008.xml

Notes/requests:

  • Compressing large files? Martin is better-able to handle .zip than .gz
  • Names with spaces in them is harder for dmacks to handle and Wikimedia is a bit schizophrenic about space vs underscore...consider hyphens or CamelCase to separate words.

Pages in category "Chembox validation"

The following 2 pages are in this category, out of 2 total.