Category:Chembox validation
Revision as of 09:16, 21 August 2009 by Physchim62 (talk | contribs)
This category is used as a repository for files associated with curation & validation of Wikipedia chembox content.
These are some of the main files:
- The main SDF file that Tony has, which has had most (all?) of its content validated. Let's call this WPChemMain2008. This is the one we've been working through to validate ChemBox content.
- The inorganics file that PC has put together. Let's call this WPChemInorganics2008
- The CAS file of 7800 (Jim's file from October, that we currently only have in XML):
- Media:CAS-CommonChemistry2008.xml.zip
- NB: doesn't validate against CommonChemistry.dtd!
- Media:CAS-CommonChemistry2008.dtd.zip
- please delete this file (not worth zipping a small .dtd, easier to handle as text)
- CommonChemistry.dtd
- Media:CAS-CommonChemistry2008.xml.zip
- The intersection file of all three put together, using structures as the check:
- The Excel file that I've been working on - essentially a manual version of #4, but which I've found very useful for catching things being missed by the scripts:
- Media:CAS-Wikipedia-Intersection-Dec2008.xls.zip -- already out-of-date
- Scary one-liner to convert it to a generic XML database format:
perl -n -e 'BEGIN {$/="\r";$_=<>; chomp; @x=split /\t/; print "<ItemList>\n"; $row=0}; END {print "</ItemList>\n"};chomp; s/"//g; @y=split /\t/; print " <Item row=\"",++$row,"\">\n",(map {" <$_>".(shift @y)."</$_>\n"} @x), " </Item>\n"' CAS_Wikipedia_Intersection_Dec2008.txt > CAS_Wikipedia_Intersection_Dec2008.xml
- Note: First line of .xls file defines the field-names, which must be alphanumeric only (no whitespace or punctuation); save as tab-delimited text file prior to conversion
- File:CAS-Wikipedia-Intersection-Dec2008-again.xml
Notes/requests:
- Compressing large files? Martin is better-able to handle .zip than .gz
- Names with spaces in them is harder for dmacks to handle and Wikimedia is a bit schizophrenic about space vs underscore...consider hyphens or CamelCase to separate words.
Pages in category "Chembox validation"
The following 2 pages are in this category, out of 2 total.
Media in category "Chembox validation"
The following 14 files are in this category, out of 14 total.
- CAS-CommonChemistry2008.xml.zip ; 1.75 MB
- CAS-WikipediaSDF-Intersection.zip ; 2.09 MB
- CAS-WikipediaSDF-Union.txt ; 1.97 MB
- CAS-WikipediaSDF-Union.zip ; 2.09 MB
- CAVer 0.7.mdb ; 7.33 MB
- Wikichem.sdf.gz ; 1.13 MB
- Wikichem.zip ; 1.05 MB