Difference between revisions of "Category:Chembox validation"

From WikiChem
Jump to: navigation, search
(Created page with 'Category:Project pages')
 
Line 1: Line 1:
 +
This category is used as a repository for files associated with [http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Chemicals/Chembox_validation curation & validation of Wikipedia chembox content].
 +
 +
These are some of the main files:
 +
 +
# The main SDF file that Tony has, which has had most (all?) of its content validated.  Let's call this WPChemMain2008.  This is the one we've been working through to validate ChemBox content.
 +
# The inorganics file that PC has put together.  Let's call this WPChemInorganics2008
 +
# The CAS file of 7800 (Jim's file from October, that we currently only have in XML):
 +
#*[[:Media:CAS-CommonChemistry2008.xml.zip]]
 +
#**NB: doesn't validate against [[CommonChemistry.dtd]]!
 +
#*[[:Media:CAS-CommonChemistry2008.dtd.zip]]
 +
#**please delete this file (not worth zipping a small .dtd, easier to handle as text)
 +
#*[[CommonChemistry.dtd]]
 +
# The intersection file of all three put together, using structures as the check:
 +
#*[[Media:CAS-WikipediaSDF-Union.zip]].
 +
# The Excel file that I've been working on - essentially a manual version of #4, but which I've found very useful for catching things being missed by the scripts:
 +
#*[[:Media:CAS-Wikipedia-Intersection-Dec2008.xls.zip‎]] -- already out-of-date
 +
<!--#*[[Text to CAS xml‎]] might convert it to CAS's XML format
 +
#**[[Image:CAS-Wikipedia-Intersection-Dec2008.xml]]-->
 +
#*Scary one-liner to convert it to a generic XML database format:<pre>perl -n -e 'BEGIN {$/="\r";$_=<>; chomp; @x=split /\t/; print "<ItemList>\n"; $row=0}; END {print "</ItemList>\n"};chomp; s/"//g; @y=split /\t/; print " <Item row=\"",++$row,"\">\n",(map {"  <$_>".(shift @y)."</$_>\n"} @x), " </Item>\n"' CAS_Wikipedia_Intersection_Dec2008.txt > CAS_Wikipedia_Intersection_Dec2008.xml</pre>
 +
#**Note: First line of .xls file defines the field-names, which must be alphanumeric only (no whitespace or punctuation); save as tab-delimited text file prior to conversion
 +
#**[[Image:CAS-Wikipedia-Intersection-Dec2008-again.xml]]
 +
 +
Notes/requests:
 +
*Compressing large files? Martin is better-able to handle .zip than .gz
 +
*Names with spaces in them is harder for dmacks to handle and Wikimedia is a bit schizophrenic about space vs underscore...consider hyphens or CamelCase to separate words.
 +
 
[[Category:Project pages]]
 
[[Category:Project pages]]

Revision as of 10:40, 17 August 2009

This category is used as a repository for files associated with curation & validation of Wikipedia chembox content.

These are some of the main files:

  1. The main SDF file that Tony has, which has had most (all?) of its content validated. Let's call this WPChemMain2008. This is the one we've been working through to validate ChemBox content.
  2. The inorganics file that PC has put together. Let's call this WPChemInorganics2008
  3. The CAS file of 7800 (Jim's file from October, that we currently only have in XML):
  4. The intersection file of all three put together, using structures as the check:
  5. The Excel file that I've been working on - essentially a manual version of #4, but which I've found very useful for catching things being missed by the scripts:
    • Media:CAS-Wikipedia-Intersection-Dec2008.xls.zip‎ -- already out-of-date
    • Scary one-liner to convert it to a generic XML database format:
      perl -n -e 'BEGIN {$/="\r";$_=<>; chomp; @x=split /\t/; print "<ItemList>\n"; $row=0}; END {print "</ItemList>\n"};chomp; s/"//g; @y=split /\t/; print " <Item row=\"",++$row,"\">\n",(map {"  <$_>".(shift @y)."</$_>\n"} @x), " </Item>\n"' CAS_Wikipedia_Intersection_Dec2008.txt > CAS_Wikipedia_Intersection_Dec2008.xml

Notes/requests:

  • Compressing large files? Martin is better-able to handle .zip than .gz
  • Names with spaces in them is harder for dmacks to handle and Wikimedia is a bit schizophrenic about space vs underscore...consider hyphens or CamelCase to separate words.

Pages in category "Chembox validation"

The following 2 pages are in this category, out of 2 total.