Online Chemistry Nexus Proposal

From WikiChem
Revision as of 07:14, 13 August 2009 by Physchim62 (talk | contribs) (How to meet these objectives: paring searching section)
Jump to: navigation, search

This is a rough draft of a proposal being prepared by User:Walkerma and others, to apply for funding through the NSF STCI program in August 2009.

If you are committed to being part of this grant proposal, please create an account, and sign below with a star then four tildes (* ~~~~). Thanks!

  • Martin A. Walker 19:13, 29 July 2009 (UTC)
  • Physchim62 21:33, 29 July 2009 (UTC)
  • Proteins 16:03, 12 August 2009 (UTC)
  • Beetstra 08:17, 13 August 2009 (UTC)

An Open Online Nexus for Chemistry

Summary

The World Wide Web has transformed the way chemists work, and yet we are still a long way from realizing the full potential of this technology for an open network of chemical information. What is needed is some organization of resources, so as to create some major hubs for the new information landscape.

It is impossible to understate the importance of the organization of information within chemistry, with more than a million new scientific articles published every year. One only needs to recall that the periodic table was born of a young professor's need to organize the material in his textbook of inorganic chemistry. Yet many of the existing information sources, such as SciFinder or Web of Science, are only really accessible to professionals working in large institutions – because of their cost – and certainly have little to offer the freshman student or the interested member of the general public.

Chemists have long recognized the intellectual merit in collecting and collating published information and repackaging it for a wider audience. The most influential journal in chemistry is not Journal of the American Chemical Society or Angewandte Chemie, it is Chemical Reviews. Indeed four out of the top five chemical journals (by impact factor) are review journals, which publish no original research results whatsoever.

This project aims to create a stand-alone website that is both a repository for chemical information in itself and also a portal to link to information held elsewhere. The site will function as a "wiki", a site which the users themselves can add to and modify: in this way, the site should remain relevant to the requirements and desires of its users.

The basic idea of a wiki-based chemistry site is not unusual; many such sites already exist, albeit on a small scale. With social networking being fashionable, we can expect to see many poorly thought-out wikis being created, then lying dormant. With websites, it is often the execution of a good idea that matters. Although we cannot predict the twists and turns of future technology, we can attempt to learn from the past - to avoid the mistakes,[1] and to learn what works well. The difference between success and failure may lie in a few lines of code. The project brings together a group with many years of experience in presenting chemistry on the web, through Wikipedia, ChemSpider and beyond, with the capability to develop a truly valuable and powerful chemistry wiki.

The site aims to serve the chemistry community in the widest sense of that term: academic and professional chemists, educators and learners of all levels, interested members of other professions and of the general public. It aims to overcome the increasing specialization of communication channels within chemistry, and to facilitate communication between chemists and with the wider community of which we all form part.

Introduction

Chemists will typically search for a variety of chemical information during a normal workday. The type of information depends greatly on the specialty of the chemist, but some general areas of information include:

  • Chemical literature – searches for relevant papers, reviews of specific topics, current awareness, patent searching, "grey information," as well as the actual primary source material.
  • Common properties of chemical compounds – structure, molecular weight, synonyms, melting point, solubility.
  • Chemical reaction information – different synthetic pathways, reagents & catalysts, reaction conditions.
  • Personal networking – job searches, other chemists working in your subject area or locality, conferences, grant advice.
  • Resources – grants available, graduate programs, sabbatical opportunities, government support, legal & business advice.
  • News and general chemical knowledge – chemical industry developments, "hot" subject areas, broad changes in law or government.

To develop a successful network hub, we must consider what works well for chemists at present. We should not create a site and then try to persuade chemists to come; rather, we should examine the current needs (and frustrations!) of chemists, then aim to meet those needs.

  • Professional societies – organizations such as ACS and RSC already provide a superb array of information and resources to meet the needs of chemists, both personal and professional. These societies traditionally form the core of networks for chemists, though obviously many resources are closed to non-members. Ideally, any new information resource should be developed in collaboration with these organizations.
  • Successful free information "hubs" on the Web – besides Google, chemists frequently search for chemical information on websites such as Wikipedia, ChemSpider and government sites.
  • Many successful information hubs in chemistry require a fee, but they are available to some members of the chemistry community. The most powerful is Chemical Abstracts Service, which provides a remarkable array of information, particularly for searching the chemical literature. Other important resources include the Science Citation Index and Beilstein/Gmelin.
  • Chemists often use information hubs that have a broader scope than just chemistry, for example the Derwent World Patents Index, Lexis-Nexis and the sites of for-profit publishers such as Elsevier (Science Direct) and Wiley.

With such great resources, why do chemists need yet another website? The recent rise of "Web 2.0" sites demonstrate the power of technologies such as wikis, and such technologies could bring great benefits to science and medicine.[2] Unfortunately, many of the existing information networks available to chemists are closed, and many involve a fee. This approach is at odds with modern "Web 2.0" methods; as Hollett has pointed out[3], the essential Web 2.0 attributes are "trust, openness, voluntariness and self-organization". If chemistry is to capitalize on the full power of the Internet, we need new sites that are open, and very different from websites of traditional information providers.

Younger chemists naturally turn to the Web for information, and they expect to find it there for free. A fellow-scientist recently shared his frustration that his graduate students rarely think to go beyond a Google search, to use the fee-based powerful resources that are freely available to them at the university. Rather than making a (fruitless) effort to "re-educate" every new student, we should adapt the resources to ensure that young researchers find the information they need.

Many existing open sites meet specific information needs for chemists, but there is no single site that brings together all of those needs under one "roof." ChemSpider serves as an important information hub, but it focuses mainly on property information on chemical compounds; information such as educational resources can not be found there. Other sites have narrower scope: Webreactions supplies chemical reaction listings, while ZINC provides a database of compounds for "virtual screening." The Organic Chemistry Portal describes the literature in that field, while webelements.com provides descriptions of the chemical elements and their basic compounds. All are useful, yet there is no site that links all of these sites together, to allow chemists to find all their answers from one chemistry portal.

Wikipedia is defined as a general encyclopedia, and it specifically excludes original research, specialist technical documents, opinion pieces, experimental procedures, educational materials, etc. The majority of chemical information lies clearly outside the scope of Wikipedia.[4] As such, it can never serve the broader needs of chemists, though it serves as an excellent model.

To develop an online community of chemists. This is the means to an end – to produce a large amount of chemistry content – it is not the purpose of the proposed site. If we can bring together even a few dozen chemists, and interest them in contributing their knowledge, we can create a paradigm-changing resource. Perhaps the most active online community at present is based at the Chemistry & Chemicals WikiProjects on Wikipedia, and this numbers around 20-50 active contributors. But this group is limited to writing encyclopedic articles. ChemSpider has a smaller group of chemists curating chemical compound/structure information. But there is no online community of chemists generating broad content to meet the wider needs of the community.

d. What works well on the Web

The history of the Internet is littered with websites that have failed. We must learn what works well, and try to avoid the pitfalls. To be successful, we must

  • Speak the language of chemists. It is vital to have any new information nexus organized by chemists for chemists.
  • Know the needs of chemists. The site should provide real and relevant content, not simply trivia. It should not be centered around an exciting program, algorithm or piece of technology - that may be "cool," but is it something chemists will really use a lot?
  • Use technical expertise. A website that is slow or unreliable will never flourish, even if it is tailored for chemists. Experts can also take full advantage of more advanced technology to provide additional features such as Jmol for structure displays. However, the technological wizardry should mainly lie in the background, and it should never be allowed to dominate the site.
  • Have a functional layout. There is the basic requirement that users be able to find what they want with as few clicks/scrolls as possible, without being overloaded with words and boxes. Successful sites such as Amazon.com, Ebay.com and Etsy.com thrive because the site is designed around what the user wants and needs.
  • Have an attractive design. Aesthetics matter! We may consider ourselves scientists who don't worry about "trivia" such eye-appeal, but in fact this can make the difference between a successful site and a failure.
  • Define a clear purpose. If the site tries to do too many things, or the purpose and scope of the site is unclear, it will fail.
e. What works well with a wiki

With the success of Wikipedia, wikis have become popular in recent years, yet many wikis fail to achieve even a basic level of use. Since we plan to use a wiki for the main infrastructure, it is critical to organize so that it flourishes. It needs to have:

  • Clear scope. Although this is important for a traditional website, it is absolutely critical for a wiki, since people will only contribute their time if they see a very clear purpose for all their hard work. Wikipedia is an encyclopedia, not a sales brochure, a "how-to" site or a site for opinion pieces. Wikitravel provides travel information, it does not try to compete with Wikipedia.
  • A reason to contribute. Many wikis fail not because the purpose is poor, but because people don't care enough about helping that purpose. Wikipedia flourishes precisely because contributors have a passion for sharing their knowledge with the world, and many will work late into the night to serve that "higher purpose." Many chemists have a similar love of chemistry, but they will only share their valuable time if they can see that the site really captures that passion and adds value to their work.
  • A community of users, and a community of contributors. The user community in our case is clear. But for the wiki to succeed, we also need to identify people who will contribute content to get the site off the ground. Without contributors, a wiki will completely fail.
  • Critical mass. As well as needing contributors, a lot of work must be done to set up and publicize the site. Some of the early content may need to be written from scratch by paid employees, in order to build a body of content that makes the site viable.
  • Scalability. Any wiki that plans to become an information nexus must be able to handle growth and traffic. Beyond the obvious technology needs, there must be a definite infrastructure to organize the contributors and direct the growth effectively. There needs to be a group of paid staff to provide commitment and continuity during times of growth - voluntary contributors come and go.
  • A clear set of rules. Wikipedia would have failed but for certain rules such as "neutral point of view" that have served as community norms, and helped restrict the scope of the site. Any social networking site has to deal with people as well as content, and people often have strong opinions and feelings. A certain amount of disagreement is inevitable and even good, but clear policies can help to reduce unnecessary friction. Rules should define what type of content is inappropriate or outside the scope of the site.
  • A style guide. How should chemical structures be drawn? What chemical names are appropriate? Should the site use American English, British English, or both? How should pictures be formatted?
  • Open access, free content. In order to allow mashups with other sites, the site must be completely open.[3] If a wiki is not completely open to the world, it will never become significant in size, and it will not be "noticed." If a wiki charges for any significant part of its content, then nearly all volunteer contributors will be completely alienated and the site will fail. All content needs to be clearly labeled with an open copyright; we plan to use the very successful Creative Commons license for this purpose, and follow the terms of the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities. However, there will also be a clear set of privacy rules, to ensure that personal information is not shared publicly.

Objectives and significance of the proposed work

We would like to see the website serve two main purposes:

a. A broad repository of user-generated content

The site should allow for chemists to place information and data easily on the web, in a place where others can find it through a simple search. We envisage scientists providing a wide variety of information including experimental methods and results, topical reviews, news stories, physical data; more details are given in section 3.

Most chemistry resources (such as ACS, CAS, or even ChemSpider) currently have a "top-down" model, where the site administrators define the information needs with blank spaces. This site would instead follow a radically different approach, based on the Wikipedia model, where the contributors themselves decide what is presented and how; this should foster an culture where most contributors feel that their work is valued. It may mean that the site develops in unexpected ways, but that should be seen as an asset, not a handicap. A flourishing community of chemists of this sort will be necessary if the site is to succeed.

b. A free chemistry portal

We aim to provide chemists with a simple, portal through which they can find chemical information on other sites. Core content available under open licenses (for example Wikipedia articles, and perhaps some ChemSpider content) might be provided on the site itself, so that it could be integrated into the site and formatted to meet our needs. Ideally we would use on-site specific "unvandalized versions" that are periodically updated, while all editing would be redirected to the host site, in order to reduce unnecessary work and "forks". Information outside the core would ideally be provided through mashups. By having one portal that allows a standard form for structure searching (and other semantic searches) of hundreds of chemistry sites, users will not need to download dozens of different Java scripts and learn the foibles of each site separately. If our portal rises high in Google rank, as we hope, it will allow users to uncover information from small sites that may be otherwise hard to find via Google, and this will in turn also benefit lesser-known sites.

How to meet these objectives

In order to meet the dual goals of being both an information repository and a portal, the site should allow users to achieve either of these within a minimum of mouse-clicks, without overwhelming users with scores of buttons and search boxes. Functionality and mashups should be organized in a straightforward manner, yet they should work seamlessly around the user's needs. This means that the foundations of the site will need to be laid out carefully in advance, rather than simply being allowed to evolve in a random way. The organic nature of the site will mean that (in time) new features may just evolve; if these become a core part of the website, the main layout may need to be changed in order to remain efficient.

We will assume that nearly all users will simply want to search on the site for information. These tasks need to be the most straightforward, but not to the total exclusion of others, since we want to encourage users to contribute content. Using the Wikimedia software, all pages have a text search box; the main page would also have a large area devoted to a structure drawing interface for inputting chemical structures.

All pages will have a simple URL that is "human-friendly" (e.g., the page on methanol would be called "Methanol") and which can have a permanent link. The Wikimedia software is organized in this way by default, and even older versions of pages can be seen and linked to. Such simplicity and predictability also tends to raise the Google rank of articles, helping the site to grow.

Data should be organized in as many ways as possible. In particular, it must be accessible to both human users and automated scripts (machine-readability) in order to produce virtual databases of chemical information.

Some examples of possible content include:

  • Chemical compound pages – listing physical properties, links to other sites, and if possible, prose content.
  • Chemical reactions, reagents – these might link to relevant literature references
  • Experimental data – physical properties, reaction results, etc. – these pages could incorporate data from open notebook science groups, and it is hoped to incorporate open notebook functionality into the site itself.
  • Literature reviews, summaries
  • Experimental procedures
  • Educational materials – lesson plans, study problems, teaching materials, at every level from grade school to graduate school.
  • News – the latest information on important breakthroughs, industry takeovers, government regulations, as well as site news.
  • Chemistry connections – links to professional societies, blogs, scientific publishers, information on grants, conferences, etc.
  • Blog – commentary on new chemistry or news
  • Wikichem community – interest groups, technical help, rules & guidelines, etc.

Searching facilities are essential to be able to find the relevant content within the site. Text searching is important, but insufficient for chemical usage: the development of effective structure searching that covers all the material on the site is a key objective of this project.

Suitability of proposed methods

The method chosen for achieving the objectives of the project is a public website in wiki format powered by MediaWiki software.

A wiki is a website which allows the simple creation of multiple interlinked webpages, usually through the use of software which automatically converts relatively simple text-like coding (stored as a database) into standard HTML (for display on the web). MediaWiki is a popular software package for the creation of wikis, available under a free license (GPLv2+). The reasons for these technical choices are explained below.

Choice of wiki format

Wikis are a relatively common choice of tool for the collaborative creation of documents. A wiki may be public or private: the best known example of a public wiki is the online encyclopedia Wikipedia,[5] although the wiki-hosting company Wikia, Inc., hosts more than 10,000 public wikis on a vast range of subjects;[6] private wikis are used by many organizations (eg, Novell, Inc.,[7] the Royal Society of Chemistry[8]) to prepare documents either for purely internal use or for subsequent publication in a different format. This proposal, for example, was drafted on a private wiki.

The technical requirements to contribute to a wiki document are access to the server (usually over the Internet) and a standard web browser: it is this simplicity of contribution which makes the format so popular for collaborative editing. The use of site preferences for converting the text into HTML ensures a homogenous appearance to the site. These are significant advantages over the production of multiple web pages by separate authors using more traditional HTML editors.

The use of multiple authors, and indeed the attraction of as many different contributors as possible, is an essential part of this proposal. This should enable a wide coverage of different fields with the chemical sciences, beyond the professional expertise of the PI and the workers funded under any award. More importantly, it ensures that the focus of the site remains on those areas which are most important to its users: the wiki format allows popular areas to expand with ease while still remaining integrated with the rest of the site content.

Choice of MediaWiki software

MediaWiki[9] is the software used by the popular online encyclopedia Wikipedia and other projects of the Wikimedia Foundation, and also on the wikis hosted by Wikia, Inc., and elsewhere. It is probably the most popular wiki software,[10] and is certainly the best known. This in itself would be a strong argument for using MediaWiki in this project, as it is the wiki software with which potential contributors are most likely to be familiar.

MediaWiki is available free of charge under the GNU General Public License 2.0 and later versions (GPLv2+).[11] The documentation is freely available under both the GNU Free Documentation License 1.2 and later versions (GFDL1.2+)[12] and the Creative Commons Attribution-Share Alike License version 3.0 Unported (CC-BY-SA-3.0).[13] As such, it is classed as “free software”. It is supported by a community of several hundred volunteer developers (and at least three salaried staff members), and has proved both robust and simple to use. The latest release is version 1.15.1 (July 13, 2009).[14]

MediaWiki functions as a database program combined with a parser to convert a text (containing simple markup) into HTML for page publication. The database contains tables for page content, internal links, users, page categorization, images and other non-text files, and other types of page: custom page types can also be created. The appearance of the site can be customized using CSS style sheets and Javascript functions are also supported. Individual users can also customize their view of the site without affecting the public view through personal CSS style sheets and Javascript functions.

In addition to the basic MediaWiki “core”, there are more than a thousand published “extensions” to MediaWiki,[15] written (like the original program) in the PHP programming language. These extensions add optional functions to the main program: one example is cite.php, which simplifies the handling of references.[16]

It is intended that the programming aspect of this project will concern the writing of extensions and/or Javascript functions, rather than the modification of the main MediaWiki program. All programs which are created during the project will be licensed under the GNU General Public License version 3.0[17] and later versions, and the accompanying documentation will be licensed under both the GNU Free Documentation License version 1.3[18] and later versions and under the Creative Commons Attribution-Share Alike License version 3.0 and later versions. This condition will be enforced contractually both for employees and for any contractors. In this way, the software resulting from this project will be freely available for reuse and modification by any other person.

Choice of creating a new website

The investigators have chosen to use the same format and basic software as the popular pre-existing website Wikipedia. However, the project concerns the development of a separate website.

The proposed website will be firmly focussed on the chemical sciences in all their forms, as opposed to the unashamed generalism of Wikipedia. The intention is to create a site “by chemists, for chemists” that will be also useful to all users of chemical information regardless of professional status or specialty. Wikipedia aims to make its articles accessible to a general audience,[19] whereas the proposed website will allow (and encourage) more technical content where the subject matter justifies it.

The proposed website will also allow the use of functionalities which are specific to chemistry. One example is searching by chemical structure: a generalist site such as Wikipedia could never justify including structure searching as an integral part of its offer, as the proportion of chemistry searches among its usage is simply too low, but a site devoted to the chemical sciences can, indeed must, offer such a possibility.

The development of specialist content and functionality is all but impossible within the Wikipedia framework, hence the need for a new site. Chemistry is not unique in having these problems – subject-specific wikis such as PsychWiki for psychology[20] and WikiDoc for medicine[21] have existed for several years – but the range and complexity of chemical data present particular challenges. It is hoped that the software and content developed during this project will prove useful not only to users of the site itself but also to other sites dealing with chemical information: Wikipedia, certainly, but also other chemistry websites and private wikis.

Resources needed

A simple, appropriate domain name, wikichem.org, is already owned by the PI and will be used for this website.

To build a site of this complexity cannot be done as a hobby project – it will require paid employees. There will be two full-time employees and one part-time employee:

Site administrator

This person would be the figurehead of the organization, as well as guiding the overall direction and development of the site. The job description would include:

  • Working with the PI and the technical developer to organize the efficient operation of the site
  • Representing the organization at meetings, conferences and press briefings
  • Organize the workflow of the staff in an efficient manner, and hold regular meetings
  • Building connections and promoting the site within the chemical community
  • Locating external data and resources that could be added to the site
  • Defining the details of the layout of the site, in consultation with the technical developer, the PI, consultants, collaborators and advisers
  • Building some of the initial site organization and infrastructure, and seeding this with core content
  • Communicating information about the site through a regular blog
  • Coordinating the user community and ensuring proper copyright compliance
  • Maintaining current knowledge of developments in and around chemistry, and of the changing face of the internet
Technical developer

This person would work to add valuable technical features onto the site, organize the servers and handle bugs in the software. The job description would include:

  • Leading the project developments and carry out tasks by identifying requirements and evaluating technical options.
  • Ensuring that the application achieves agreed performance and availability
  • Advising on the technical requirements to deliver new technologies, products and services.
  • Promoting the technical solution associated with this project by attending and presenting at relevant meetings.
  • Participating in general software design and development including database development; setting milestones and targets.
  • Remaining up to date with state-of-the-art practices, technologies, algorithms and coding standards.
  • Managing the technical aspects of the project, using project management tools and regular project meetings.
Support staff

This person would perform routine office functions, including:

  • Preparing letters, electronic mail, handling phone calls and routine mail/electronic mail
  • Arranging meetings and taking minutes
  • Receiving visitors
  • Making travel arrangements and other bookings, ordering supplies.
Role of the Principal Investigator

The PI will work with the employees to guide the growth and direction of the site. Responsibilities will include:

  • Developing a broad vision for how the site should develop, in collaboration with the permanent staff and collaborators.
  • Assisting in developing the site organization and infrastructure
  • Seeding the site with core content
  • Supporting the work of the site administrator - building contacts, promoting the site, etc.
  • Contributing to the site blog
Additional support
  • Support during the development phase will be provided through summer work by students (including undergraduates) and faculty. This work would provide information to "seed" the site and build the site infrastructure.
  • Guidance will be provided on a day-to-day basis by the PI, and quite regularly by consultants and contractors. Occasional specialized advice will be provided by the advisory board.
  • Professional legal advice will also be needed periodically.
Other facilities

An office will be provided on the college campus for the employees to perform their work, along with the necessary furniture and computers.

Qualifications of the investigator and the grantee organization

The Principal Investigator has worked professionally as a chemist for 28 years, both in industry and academia. His Ph.D. in synthetic organic chemistry and his background in fine chemical processing provide him with a solid foundation in mainstream chemistry. The PI also worked in chemical information, and his Ph.D. adviser (James B. Hendrickson) helped pioneer the use of computers in chemistry. For almost five years, the PI has contributed to Wikipedia, especially in the chemistry area. He currently coordinates the Wikipedia 1.0 project, which produces offline releases and organizes article assessment.[22]

For a project of this sort, collaboration and network-building are essential attributes. The PI recently negotiated an agreement between Chemical Abstracts Service and the Wikipedia Chemicals WikiProject whereby CAS provides Registry Number information in exchange for Wikipedia links, breaking with a long tradition of keeping such information "closed". This in turn led to the establishment of a free CAS website, providing basic chemical information for the general public.[23] The PI's internal work on Wikipedia, such as setting up and overseeing a system for article assessment (now used on over 1.8 million articles from over 1000 projects), has required a high degree of flexibility and leadership, as will be needed for the proposed Wikichem site.

But for a project with such broad scope, a diverse group of co-PIs, senior personnel and advisers is needed. Each of these people brings a unique skill to the group. The group is as follows:

Co-PIs/Senior personnel
  • Jean-Claude Bradley, organic chemistry professor at Drexel University, blogger and pioneer of Open Notebook Science.
  • Elizabeth Brown, Scholarly Communications and Library Grants Officer, Binghamton University Libraries.
  • Andrew Lang, mathematics professor at Oral Roberts University in Tulsa, OK, also involved in Open Notebook Science work.
  • Nigel Wheatley, a PhD chemist, former university lecturer and school teacher, now works as a science writer and consultant; an experienced Wikipedia contributor.
  • Antony Williams, founder of ChemSpider, now VP Strategic Development ChemSpider, Royal Society of Chemistry.
Advisory group
  • Dirk Beetstra, Postdoctoral fellow in chemical technology at Eindhoven University of Technology; experienced Wikipedia contributor and the developer of automatic content correction tools for chemistry articles on Wikipedia.
  • Daniel Mayer, a very experienced Wikipedian with a background in biology and geology.
  • Harry Pence, Distinguished Teaching Professor in chemistry at SUNY Oneonta, expert in chemical education and longtime advocate of technology in the lecture hall.
  • John Proetta, chemistry undergraduate, SUNY Potsdam.
  • Alexander Tropsha, K.H. Lee Distinguished Professor and Chair of the School of Pharmacy at UNC-Chapel Hill, working in bioinformatics, cheminformatics, and computational drug discovery.
  • Bethany Usher, professor of biological anthropology at SUNY Potsdam, and director of the campus Center for Undergraduate Research.
  • William Wedemeyer, Assistant Professor, Dept. of Biochemistry & Molecular Biology, Michigan State University.
  • Andrew Yeung, Research Assistant, Department of Chemistry, National University of Singapore; experienced Wikipedia contributor.

The grantee organization, SUNY Potsdam, is a four-year college which was the first teacher education college in New York. Within the State University of New York system, it now offers a full range of subjects, including chemistry, biochemistry, geology, physics, environmental studies and science education. It provides an excellent environment in which the project can flourish; the chair of the chemistry department and the head of technology services have both offered their enthusiastic support (see Appendix).

Proposed activity and effect on infrastructure

The proposed activity has been divided into three sections:

  • technical activity, which requires a high-degree of programming skill and familiarity with the various software environments;
  • technical-content activity, which requires a low-to-moderate degree of programming skill (or at least a collaboration with someone with some programming skills) or which requires specific expertise in the functioning of wikis;
  • content activity, which requires a good general knowledge of chemistry and wiki markup, but little or no programming or other technical expertise.

As with any novel project, it is impossible to give an exact description of the activity that will be necessary or desirable. While the activity in an experimental project depends intimately on the experimental results obtained, the activity in developing the proposed site will necessarily depend on feedback from its users. The activity described here is intended to acheive the objectives described above.

Technical activity

Set up site, including choice and installation of appropriate pre-existing MediaWiki extensions

An ad hoc test site has already been set up at SUNY Potsdam. However it is expected that the installation will have to be reviewed in the light of the objectives of this proposal to ensure that it takes full advantage of the software tools which are already available for MediaWiki sites.

Extensions

A Jmol extension for MediaWiki wikis already exists, to allow display of .mol files.[24] Hence this activity would first involve installing and testing this extension, and then modifying it if enhanced functionality and/or usability are deemed necessary.

Structure searching would initially be limited to exact structure matches. However it is hoped to extend the structure searching options to include searching by molecular fragments and reaction searching, and it is expected that development of searching options will occupy a large proportion of the technical activity of the project. There are some structure searching tools in the Chemistry Development Kit (CDK),[25] written in JavaScript and released under the Lesser General Public License (LGPL),[26] and the free molecule editor BKChem,[27] written in Python, also provides a basis from which to create new free software molecular structure searching tools.

Chemical markup language (CML) is a dialect of XML allowing the transfer of chemical information between different software applications.[28] An extension is not needed to store CML files (which is trivial), but rather to integrate them with the rest of the content on the site. The data stored in the CML file needs to be accessible from other areas of the site and the output needs to be readable by humans for this integration to be deemed successful.

The main challenge of integrating Open Notebook Science into the site is that a laboratory notebook is more like a blog – a chronological series of entries which occasionally might generate comments from outsiders – than a wiki. However blogs and wikis are both just databases of content coupled with a parser to produce HTML output for publication on the Web. An extension that provides efficient crosslinking between the two datasets would create a powerful tool for researchers that could also be used on other sites (including private sites for data that cannot be immediately published).

Preparation of usage statistics and regular database dumps

This point may seem trivial, but usage statistics are essential to gage the areas of the site which attract the most interest among users, while regular database dumps (once every three months as an absolute minimum) are necessary to ensure data integrity. Database dumps will be published as compressed files for reuse, and copies will be stored away from Potsdam, New York (the location of the main servers) to insure against catastrophic server failure.

Technical-content activity

Choice of supported filetypes and MediaWiki "namespaces"

The choice of filetypes determines the types of data which can be easily stored on the site: as such, it is intended to offer as wide a range as possible, while being aware that certain proprietary filetypes are "insecure" in that they can harbor malicious code. The MediaWiki namespaces allow a top level of categorization of content: there should be sufficient distinct namespaces to allow content to be segregated by type, but not so many as to confuse users.

Ensuring correct procedures for the labelling of contributed content with its correct copyright license

If content is to be reused, it must be clear under which terms that reuse is possible. It is inevitable that the content hosted on the site will come under a range of different copyright licenses, not least as there are differences of opinion within the Open Science community as to which is the most appropriate copyright license to use. As such, each page of the site will have to be licensed separately, and the technical means of ensuring this put in place.

Development of automated and semi-automated methods for the validation and correction of certain content

Some initial work has already been carried out at Wikipedia on the automated correction of numerical data in chemistry articles,[29] and it is hoped to expand that work on this site.

Development of methods to ensure the machine-readability of content at the minimum effort for the contributor

Machine-readability is essential for automated correction, but it is also useful for data mining and other forms of reuse. However, machine-readability must not be too onerous on the contributors of content, otherwise no content will be contributed. The correct trade-off between site structure and ease of contribution is essential for the success of the site.

Content activity

Development of a robust categorization system

In MediaWiki, "Categories" are one of the main methods of organizing content (along with namespaces). Although categories can be relatively easily modified to accommodate new material, it is important to have an initial structure in place to avoid pages "getting lost" as they are added.

Creation of portals to enable access to initial material

Portals are important to new or casual users to provide an overview of the content which is available in a particular subject area. As with categories, these are "living" documents that can be easily modified but it is important to have an initial "seed structure" in place to solicit further contributions.

Selection of initial material, including creation of material where necessary

An empty wiki is like a blank piece of paper: there are lots of things which can be done with it, but most people wouldn't find it very interesting. The site will need some initial content to attract its first users and to exemplify the possibilities for expansion. Much of this content can be adapted from Wikipedia articles, but some will have to be written de novo.

Development of a robust set of policies and guidelines

The only effective community policies are those which are developed by the community itself, but there needs to be an initial set of rules covering areas such as the scope of the site, copyright etc. It is important that these rules are as short and simple as possible: it is a recurring criticism of Wikipedia that its rules have become so involved and complex that they are a deterrent to potential contributors.

Identification of online chemistry resources

The provision of links to other online chemistry resources is a (relatively) quick and easy way of creating useful "seed content".

Policing of vandalism and copyright violations

It would be naive to expect that an open site would never experience vandalism. Fortunately, such idiotic actions are only a small proportion of changes to most public wikis, and tend to be proportional to the overall site traffic. If an when the community of users expands, such policing will become a community function but, at the beginning, it will probably have to be performed by the core developers. Copyright violations usually arise from misunderstanding rather than malicious intent, but again they must be removed as quickly as possible and this task will initially fall to the core developers.

Broad impact

The broad applicability of this project is clear in the very goals of the site - we aim to attract a wide variety of content from a wide variety of people, and create a portal that is valuable across a very wide spectrum of natural sciences. There may even be a cross-fertilization of ideas across different discipline. A 2008 study by the Association of Research Libraries noted: "we heard anecdotal evidence that models are indeed jumping the disciplinary divide as scholars observe new models that work and adapt them to suit their own discipline."

a. Range of content

This site aims to provide a service to all areas of chemistry; we hope that subprojects of people from different areas may evolve to organize content in those areas, as happens with "WikiProjects" on Wikipedia. We envisage providing - through on-site content or mashups - the complete range of chemical information, such as thermodynamic data, preparative procedures for organic chemists, nanoparticle toxicity data, inorganic chemistry reviews, etc. A significant amount of traffic is expected to come from non-chemists who use chemical techniques, or from those working in interdisciplinary areas. Internal links can help explain highly specialized technical terms, making a wiki of this sort more appealing to non-chemists.

Some of our advisory group were asked to provide comments on this:

"Chemistry is at the heart of many disciplines, such as biochemistry, materials science and polymer science. The proposed nexus site would be useful to these other fields as well, amplifying the benefits for the initial investment. For illustration, undergraduate students learning biochemistry need to understand many basic chemical concepts and nomenclature, such as pKa, chirality, and relative reactivities of chemical groups. Researchers in biochemistry could also be helped by knowing the latest chemical techniques and the methods for carrying them out. Furthermore, the chemistry nexus website could be the seed from which an integrated network of such sites for related disciplines could grow."

William Wedemeyer, Assistant Professor, Dept. of Biochemistry & Molecular Biology, Michigan State University.

"Anthropologists and archaeologists rely on chemistry to help understand human diversity and history. For instance, chemical testing is done to determine archaeological site use, or diet composition. Anthropologists often refer to professional literature that includes chemical jargon that may not be readily understandable to them. An online datebase that would allow anthropologists to understand what tests are available to help answer their research questions, and that would give them reference material to understand research literature, would be exceptionally helpful, and could encourage more collaboration between the fields."

Bethany Usher, Associate Professor, Dept. of Anthropology, SUNY Potsdam

"I am very excited about your proposal under the NSF’s STCI Program to create a wiki-based hub/nexus for chemistry. As a computational medicinal chemist I work at the interface between chemistry and biology (and between chemists and biologists!). This is a very challenging position because generally speaking both groups lack necessary complimentary knowledge due to their respective training. Ultimately, both groups would like to have a drug (most often, an organic molecule) emerging from their efforts, and in this regard an access to diverse chemical information that would become available in your proposed portal would be a critical resource. The unique aspect of your proposed portal is its interactive nature with respect to its users enabling the entire community of chemists to share their knowledge and learn from the experience of others."

Alexander Tropsha, K.H. Lee Distinguished Professor and Chair of the School of Pharmacy at UNC-Chapel Hill.

One of our senior personnel provides a useful perspective - how librarians might use the site:

(To be added)

Elizabeth Brown, Scholarly Communications and Library Grants Officer, Binghamton University Libraries

There may even be a cross-fertilization of ideas across different disciplines. A 2008 study by the Association of Research Libraries noted: "we heard anecdotal evidence that models are indeed jumping the disciplinary divide as scholars observe new models that work and adapt them to suit their own discipline."[30]

b. Different levels of knowledge and experience

Many chemistry resources are geared towards chemistry professionals, but these can be quite challenging for students and non-chemists to comprehend. Our proposed site aims to provide the information that professionals need, but this can be provide along with content written at a more basic level. A wiki is excellent at providing the context for information - such as linking to explanations of technical terms - without interrupting the flow of text for professional users. The lack of page limits means that content can be written at multiple levels where needed. The site is expected to prove popular among high school students and undergraduates. Our student adviser provided us with his viewpoint:

(To be added)

John Proetta, chemistry major/senior undergraduate, SUNY Potsdam.

c. Breadth of applications

The site will provide content aimed specifically at a range of groups, such as educators, students and chemistry professionals, both industrial and academic. Yet by bringing this variety of information together on one site, all of these groups have access to the full range of information resources. For example, an industrial chemical engineer may end up using the same information as a high school teacher, for very different purposes.

The site can also contribute a significantly to the infrastructure needed for cyberlearning, defined as "the use of networked computing and communications technologies to support learning." The vision for the Wikichem project is in close alignment with the five recommendations of the Task Force on Cyberlearning, as described in their 2008 report.[31]

Why an STCI grant?

This document clearly describes a proposal for cyberinfrastructure, which is what the STCI program was created to support: "The primary purpose of the Strategic Technologies for Cyberinfrastructure Program (STCI) is to support work leading to the development and/or demonstration of innovative cyberinfrastructure services for science and engineering research"

Many current chemistry websites are small for a reason; with limited resources, they can only offer a "niche" service. If the site is to deliver good content, the scope must be narrow. A good example is Synthetic Pages, which does a very nice job of providing experimental procedures for organic chemists. However, many organic chemists may be unaware of the site, unless they stumble upon it. What is needed is something broader, but using science terms, to bring together such information in a systematic way and make it accessible - a nexus to connect sites together, a backbone of the new cyberinfrastructure.

Other related NSF appear to be less appropriate; the CDI program aims to "create revolutionary science and engineering research outcomes made possible by innovations and advances in computational thinking." Our proposal may lead to one or two innovations of this sort, but the core of the proposal is simply to provide existing types of information in a new way - not to generate new types of information. The project will also require some new code, but again, that is not the main purpose, making the SHF program also unsuitable.

In summary, we consider the STCI program to be perfectly aligned with our project goals.

Amount of funding required

A significant amount of support is needed to establish a major information hub of this sort. Although some websites can start very small and grow, these usually succeed by focusing on a particular niche; it is hard for such sites to have the broad scope needed to bring together diverse forms of information from all areas of chemistry. In order to succeed, our proposal requires permanent employees with a high level of expertise, working for a sustained three-year period. Salaries for these staff make up the majority of the budget. Only in this way can the site have the professional support and direction needed to succeed.

Longevity

Three years should provide sufficient time to judge the success of this project; by the end of that time, we should know whether it has succeeded or not. However, the aim of this project is to establish a website that will endure well beyond the three year period of the funding requested here. We have considered the long term impact of the project based on three possible outcomes:

Low impact

If, for any reason, the site fails to establish itself as a major internet hub within three years, funding and support for the site may cease. However, the legacy of the site (other than basic lessons learned) could still be significant, in terms of open source chemistry applications and scripts written for use within the wiki environment. Also, much of the user-generated content may be unique, and we would work with other websites to preserve that content; it is important that such content not be lost.[32]

Medium impact, and growing

It is possible that the site establishes itself as a valuable resource after three years, but it has not reached the point where it can support itself. In this case, we may apply for renewal of grant support from NSF, as well as seeking funding elsewhere.

High impact, and self supporting

If the site "takes off" as intended, it should become important enough that outside organizations see value in supporting the work. This has already been seen with ChemSpider, recently acquired by the Royal Society of Chemistry.[33] In this scenario, further NSF grant support is unlikely to be needed. The site may end up being supported by a professional body such as ACS or IUPAC, or alternatively operate as a non-profit organization like Wikipedia or the InChI Trust.[34]

References

<references>

  1. Kent German, CNET, "Top 10 dot-com flops", accessed August 11, 2009
  2. Wikis, blogs and podcasts: a new generation of Web-based tools for virtual collaborative clinical practice and education. Kamel Boulos, M. N.; Maramba, I.; Wheeler, S. BMC Medical Education 2006, 6, 41. doi:10.1186/1472-6920-6-41
  3. 3.0 3.1 The Web 2.0 way of learning with technologies. Rollett, H., Lux, M.,Strohmaier, M., Dösinger, G. and Tochtermann, K., Int. J. Learning Technology, 2007, 3(1), 87–107.
  4. What Wikipedia is not, Wikimedia Foundation, accessed August 13, 2009.
  5. http://en.wikipedia.org/
  6. Source: Wikia, Inc.: http://www.wikia.com/wiki/About_Wikia
  7. Eg, http://developer.novell.com/wiki/index.php/Developer_Home
  8. Personal communication from Colin Bachelor, Royal Society of Chemistry.
  9. http://www.mediawiki.org/wiki/MediaWiki . MediaWiki was developed by Magnus Manske, Brion Vibber, Lee Daniel Crocker, Tim Starling, Erik Möller, Gabriel Wicke, Ævar Arnfjörð Bjarmason, Niklas Laxström, Domas Mituzas, Rob Church, Yuri Astrakhan, Aryeh Gregor, Aaron Schulz and others: for a full list, see http://www.mediawiki.org/wiki/Special:Code/MediaWiki/author
  10. Because of the unknown number of private wikis, it is impossible at present to determine the most popular wiki software in terms of number of sites using the software.
  11. Free Software Foundation (June 1991), “GNU General Public License version 2”: http://www.gnu.org/licenses/old-licenses/gpl-2.0.txt
  12. Free Software
    Foundation (November 2002), “GNU Free Documentation License version 1.2”: http://www.gnu.org/licenses/old-licenses/fdl-1.2.txt
  13. Creative Commons, “Attribution-Share Alike 3.0 Unported”: http://creativecommons.org/licenses/by-sa/3.0/
  14. Source: MediaWiki. http://www.mediawiki.org/wiki/MediaWiki
  15. MediaWiki, “Extensions”: http://www.mediawiki.org/wiki/Manual:Extensions
  16. Cite.php (latest version 1.11) was written by Ævar Arnfjörð Bjarmason: http://www.mediawiki.org/wiki/Extension:Cite/Cite.php
  17. Free Software Foundation (June 2007), “GNU General Public License version 3”: http://www.gnu.org/licenses/gpl-3.0.txt
  18. Free Software Foundation (November 2008), “GNU Free Documentation License version 1.3”: http://www.gnu.org/licenses/fdl-1.3-standalone.html
  19. Wikipedia, “Make technical articles accessible”: http://en.wikipedia.org/wiki/Wikipedia:Make_technical_articles_accessible
  20. http://www.psychwiki.com/
  21. http://www.wikidoc.org/
  22. Wikipedia:Version 1.0 Editorial Team, Wikimedia Foundation. Accessed August 13, 2009.
  23. CAS Launches Free Web-Based Resource "Common Chemistry" for General Public, CAS News Release, May 12, 2009.
  24. http://wiki.jmol.org/index.php/MediaWiki
  25. REF NEEDED Chemistry Development Kit
  26. REF NEEDED Lesser General Public License
  27. REF NEEDED BKChem
  28. REF NEEDED Chemical Markup Language
  29. The vast majority of this work has been carried out by Dirk Beetstra: see, e.g., http://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/CheMoBot_2
  30. [Current Models of Digital Scholarly Communication:] Results of an Investigation Conducted by Ithaka for the Association of Research Libraries, November 2008, Association of Research Libraries.
  31. Fostering Learning in the Networked World: The Cyberlearning Opportunity and Challenge, A 21st Century Agenda from the National Science Foundation, Report of the NSF Task Force on Cyberlearning, June 24, 2008, National Science Foundation.
  32. Keeping the records of science accessible: can we afford it?. Report on the 2008 Annual Conference of the Alliance for Permanent Access, Budapest, November 4, 2008, Alliance for Permanent Access.
  33. RSC acquires ChemSpider, RSC Press Release, 11 May 2009.
  34. Launch of the InChI Trust, Nature Publishing Group Press Release, 21 July 2009.