Free statistical software: Difference between revisions
imported>Gene Shackman No edit summary |
mNo edit summary |
||
(167 intermediate revisions by 8 users not shown) | |||
Line 1: | Line 1: | ||
{{subpages}} | {{subpages}} | ||
{{TOC|right}} | |||
'''Free statistical software''' is a practical alternative to commercial packages. In general, free statistical software gives results that are the same as the results from commercial programs, and many of the packages are fairly easy to learn, using menu systems, although a few are command-driven. These packages come from a variety of sources, including [[government]]s, [[nongovernmental organization]]s (NGOs) like [[UNESCO]], and [[University|universities]], and are also developed by individuals. | |||
Some packages are developed for specific purposes (e.g., time series analysis, factor analysis, calculators for probability distributions, etc.), while others are general packages, with a variety of statistical procedures. This article is a review of the general statistical packages. | |||
==Brief history of free statistical software== | |||
Some of the free software packages are from governmental or NGO organizations, such as Epi Info, from CDC<ref name=epiinfo>Epi Info, CDC, 2008 http://www.cdc.gov/epiinfo/index.htm.</ref>, and IDAMS from UNESCO<ref name=idams>IDAMS Statistical Software, http://portal.unesco.org/ci/en/ev.php-URL_ID=2070&URL_DO=DO_TOPIC&URL_SECTION=201.html</ref>. Some other software packages are from smaller or independent organizations or universities, such as Instat<ref name=instat>Instat - an interactive statistical package, Statistical Services Centre - University of Reading, 2009. http://www.ssc.rdg.ac.uk/software/instat/instat.html</ref> or Irristat<ref name=irristat>Irristat, International Rice Research Instititue, Biometrics and Bioinformatics Unit, http://www.irri.org/science/software/irristat.asp</ref>. Another package, the [[R (programming language)|R Project]]<ref name=r>The R Project, http://cran.r-project.org/</ref> is being developed by a group of volunteer individuals. A large proportion of free statistical software packages, however, are from individuals. Some of these software packages from individuals include Easyreg<ref name=easyreg>Easy Reg International, Herman Bierens, Penn State University, 2008 http://econ.la.psu.edu/~hbierens/EASYREG.HTM</ref>, MicrOsiris<ref name=osiris>MicOsiris, Neal Van Eck, Van Eck Computer Consulting http://www.microsiris.com/</ref>, OpenStat<ref name=openstat>OpenStat, Bill Miller, 2009 http://www.statpages.org/miller/openstat/</ref>, PSPP<ref name=pspp>PSPP, 2008 http://www.gnu.org/software/pspp/</ref>, SOFA<ref>Dr Grant Paton-Simpson, SOFA - Statistics Open For All, http://www.sofastatistics.com/home.php</ref> and Zelig<ref name=zelig>Imai, Kosuke, Gary King and Olivia Lau. 2006. “Zelig: Everyone’s Statistical Software,” http://GKing.Harvard.Edu/zelig.</ref>. | |||
At least one package, WinIDAMS, was developed for the purposes of making key technologies available to those who could not otherwise afford them, to empower development<ref>UNESCO. 03-11-2004 . In Focus: Communication and Information Sector's In Focus service. UNESCO and Software. http://portal.unesco.org/ci/en/ev.php-URL_ID=17447&URL_DO=DO_TOPIC&URL_SECTION=201.html</ref>. OpenStat and Instat were developed as teaching aids<ref name=openstat/><sup>,</sup><ref name=instat/>. Other packages were developed for specific purposes but can be more generally used. Examples are Irristat<ref name=irristat/>, developed for agricultural analysis, and Epi Info<ref name=epiinfo/>, developed for public health. Several of the packages, PSPP, R and Osiris don't appear to give any statements about why they were developed, other than just general use for statistical analysis. | |||
These free software packages have been used in a number of scholarly publications. For example, OpenStat was used in a research letter to JAMA<ref>Future Salary and US Residency Fill Rate Revisited, Mark Ebell. Research letter in JAMA, September 10, 2008—Vol 300, No. 10, p1131-1132. http://jama.ama-assn.org/cgi/reprint/300/10/1131</ref> and in several published studies<ref>Differential gene expression patterns in cyclooxygenase-1 and cyclooxygenase-2 deficient mouse brain. Christopher D Toscano, Vinaykumar V Prabhu, Robert Langenbach, Kevin G Becker, and Francesca Bosetti. Genome Biol. 2007; 8(1): R14. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1839133 <br/><br/>M Bielaszewska, B Sinha, T Kuczius and H Karch. Cytolethal Distending Toxin from Shiga Toxin-Producing Escherichia coli O157 Causes Irreversible G2/M Arrest, Inhibition of Proliferation, and Death of Human Endothelial Cells. Infection and Immunity, January 2005, p. 552-562, Vol. 73, No. 1. http://iai.asm.org/cgi/content/full/73/1/552<br/><br/>C.D. Toscano, P.J. Kingsley, L.J. Marnett, and F. Bosetti1. NMDA-induced Seizure Intensity is Enhanced in COX-2 Deficient Mice. Neurotoxicology. 2008 November; 29(6): 1114–1120.http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2587528<br/><br/></ref>. Irristat is used in this agricultural report<ref>FAO Plant Production and Protection Paper No. 174, Rome, 2003, Genotype x environment interactions. Challenges and opportunities for plant breeding and cultivar recommendations, http://www.fao.org/DOCREP/005/Y4391E/y4391e00.htm </ref>, EasyReg is listed or used in these papers<ref>A Gambardella and Bronwyn H. Hall, "Proprietary versus public domain licensing of software and research products" (2006). Research Policy. 35 (6), pp. 875-892. Postprint available free at: http://repositories.cdlib.org/postprints/1865.<br/><br/>Liu, Wen-Chi and Tsangyao Chang, (2008) "Rational Bubbles in the Korea Stock Market? Further Evidence based on Nonlinear and Nonparametric Cointegration Tests." Economics Bulletin, Vol. 3, No. 34 pp. 1-12. http://economicsbulletin.vanderbilt.edu/2008/volume3/EB-08C30021A.pdf<br/><br/>Harumi Itoa and Darin Lee, Journal of Economics and Business, Volume 57, Issue 1, January-February 2005, Pages 75-95. Assessing the impact of the September 11 terrorist attacks on U.S. airline demand. http://dx.doi.org/10.1016/j.jeconbus.2004.06.003. Also available here http://www.brown.edu/Departments/Economics/Papers/Papers/2003/2003-16_paper.pdf<br/><br/></ref>, EpiInfo was used in these papers<ref>Rahav G, Gabbay R, Ornoy A, Shechtman S, Arnon J, Diav-Citrini O. Primary versus nonprimary cytomegalovirus infection during pregnancy, Israel. Emerg Infect Dis [serial on the Internet]. 2007 Nov [May 15, 2009]. Available from http://www.cdc.gov/EID/content/13/11/1791.htm<br/><br/>Chan P-C, Huang L-M, Wu Y-C, Yang H-L, Chang I-S, Lu C-Y, et al. Tuberculosis in children and adolescents, Taiwan, 1996–2003. Emerg Infect Dis [serial on the Internet]. 2007 Sep. Available from http://www.cdc.gov/EID/content/13/9/1361.htm<br/><br/>ME Gyasi, WMK Amoaku, and MA Adjuik. Epidemiology of Hospitalized Ocular Injuries in the Upper East Region of Ghana. Ghana Med J. 2007 December; 41(4): 171–175. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2350113<br/><br/></ref>, R was used in these papers<ref>Mark S. Handcock, David R. Hunter, Carter T. Butts, Steven M. Goodreau, and Martina Morris. statnet: Software Tools for the Representation, Visualization, Analysis and Simulation of Network Data. J Stat Softw. 2008; 24(1): 1548–7660. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2447931<br/><br/>Michael E. Hume, Charles M. Scanlan, Roger B. Harvey, Kathleen Andrews, James D. Snodgrass, Armen G. Nalian, Alexandra Martynova-Van Kley, and David J. Nisbet. Denaturing Gradient Gel Electrophoresis as a Tool To Determine Batch Similarity of Probiotic Cultures of Porcine Cecal Bacteria. Applied and Environmental Microbiology, August 2008, p. 5241-5243, Vol. 74, No. 16. http://aem.asm.org/cgi/content/abstract/74/16/5241<br/><br/>Max Bylesjö, Jeremy K Nicholson, Elaine Holmes and Johan Trygg. BMC Bioinformatics 2008, 9:106. http://www.biomedcentral.com/1471-2105/9/106<br/><br/></ref> and WinIdams was used in these papers<ref>N. S. Sapre, N. Pancholi, and S. Gupta, Computational Modeling of Substitution Effect on HIV–1 Non–Nucleoside Reverse Transcriptase Inhibitors with Kier–Hall Electrotopological State (E–state) Indices, Internet Electron. J. Mol. Des. 2008, 7, 55–67, http://www.biochempress.com/cv07_i03.html <br/><br/>Chawla, Anju. Exploring project selection behavior of academic scientists in India. Research Evaluation, Volume 16, Number 1, March 2007 , pp. 35-45(11). http://www.ingentaconnect.com/content/beech/rev/2007/00000016/00000001/art00004<br/><br/></ref>. | |||
While Microsiris doesn't appear to be used in academic research, the author of the program was one of the original authors of OSIRIS<ref>Data Sharing for Demographic Research Knowledge Base, question on OSIRIS, University of Michigan, http://dsdr-kb.psc.isr.umich.edu/answer.html?i=1076</ref>, which was the starting program from which WinIdams was developed<ref name=micwin>IDAMS, Internationally Developed Data Analysis and Management Software Package. WinIDAMS Reference Manual (release 1.3) UNESCO, 2008. Preface. http://portal.unesco.org/ci/en/ev.php-URL_ID=25081&URL_DO=DO_TOPIC&URL_SECTION=-465.html</ref>. The author of Microsiris also has also contributed or co-contributed several components to WinIdams<ref name=micwin/>. | |||
==Reviews of free statistical software== | |||
There are a few reviews of free statistical software. There were two reviews in journals (but not peer reviewed), one by Zhu and Kuljaca<ref>"A Short Preview of Free Statistical Software Packages for Teaching Statistics to Industrial Technology Majors" Journal of Industrial Technology (Volume 21-2, April 2005), Ms. Xiaoping Zhu and Dr. Ognjen Kuljaca. http://www.nait.org/jit/current.html</ref> and another article by Grant that included mainly a brief review of the [[R (programming language)|R Project]]<ref>Felix Grant, "Free Statistics Software, Yours, Free to keep....", Scientific Computing World, Sept/Oct 2004, http://www.scientific-computing.com/scwsepoct04free_statistics.html </ref>. Zhu and Kuljaca outlined some useful characteristics of software, such as ease of use, having a number of statistical procedures and ability to develop new procedures. They reviewed several programs and identified which ones, at that time, had the most functionality. At that time, several of the programs may not have had all of the desired ability for advanced statistics. Grant reviewed some of the programing features of R, and briefly mentioned the availability of other programs. One other paper reviewed statistical packages, mainly commercial, but includes R<ref>Edward J. Wegman and Jeffrey L. Solka. 2005. Statistical Software for Today and Tomorrow. http://www.galaxy.gmu.edu/ (listed as "A Guide to Statistical Software".</ref>. One article reviewed EasyReg and included a discussion of it's accuracy<ref>Hwan-sik Choia and Nicholas M. Kiefer, Software evaluation: EasyReg International. International Journal of Forecasting. Volume 21, Issue 3, July-September 2005, Pages 609-616. http://dx.doi.org/10.1016/j.ijforecast.2005.02.003 </ref>. | |||
Only one review has compared the output of various packages<ref name=shackman>Shackman, Gene. 2006. "Comparing free statistical software for data sets with no missing values" and "Comparing free statistical software, Handling missing data". Both available here "Free Software" http://gsociology.icaap.org/methods/soft.html</ref>. In this review, all of the packages read either CSV (Comma Separated Values - text files in which all values are separated by commas) files or excel format. All of the packages gave exactly the same results for correlation and regression. The free software packages also gave the same regression results as did excel. One of the main differences among the packages was how they handled missing data. With the example data sets used in the review, and for the package versions available in November 2006 when this review was conducted, two packages, MicrOsiris and Epi Info, could read files with blanks for missing. Two other programs, Stat4U and WinIdams need something for the missing, like -9 or -9.99. The other packages could only handle data sets with no missing values. | |||
Two websites that list software also have very brief reviews of each package. These two sites are StatCon<ref>List of free statistical software, Open Source & Public Domain Packages with Source Code. StatCon 2006. http://statistiksoftware.com/free_software.html</ref> and by Pezzullo<ref>Pezzullo, Free Statistical Software, 2009. http://statpages.org/javasta2.html</ref>. These sites mainly offer a brief list of the features available in the packages. Similarly, one other web site compares the statistical procedures available on free statistical packages<ref>Andrea Corsini. 2009. Free Statistics. Free statistical software comparisons. http://en.freestatistics.info/comp.php</ref>. In this review, R had all of the procedures, OpenStat had 16, MacAnova had 15, and Microsiris had 12. The others had from 8 to 11 of the procedures. | |||
There is also a journal specifically for statistical software<ref>Journal of Statistical Software, http://www.jstatsoft.org/</ref>, although the main focus is on commercial software, R and some coding snippets. | |||
In contrast, there are various reviews of commercial statistical software, such as a comparison between several major packages<ref>Acock, Alan C. “SAS, Stata, SPSS: A Comparison”. Journal of Marriage and Family, November 2005, Vol.67, pp. 1093-1095). Summarized in Hom, Willard. 2006. Choosing Between SAS, Stata, and SPSS. http://www.cccco.edu/SystemOffice/Divisions/TechResearchInfo/ResearchandPlanning/AbstractsofResearch/ResearchMethods/tabid/302/Default.aspx</ref> and a brief review of several packages<ref>Wass, John. No date. Comparative Statistical Software Review. Tabulations and musings from your editor's biased perspective. Scientific Computing. http://www.scientificcomputing.com/comparative-statistical-software.aspx</ref>. | |||
==Using free statistical software== | |||
Before using any statistical packages, it is generally a good idea to have a solid background in [[Statistics]]. Then the packages can be used to the best advantage, for example, to choose the most appropriate test, to make sure all the necessary assumptions are met, so that the appropriate conclusions can be drawn. | |||
Once the statistical issues are understood, the next step is to decide which package to use. Most of these packages are menu driven, and can be learned a couple of hours at most, except R, which is generally code driven and requires a much longer time to learn, and to some extent CDC's Epi Info, which also takes some time to learn. | |||
Several of the packages also have tutorials. These tutorials help with a basic introduction and learning the basics of programs. For example, CDC has these tutorials about Epi Info<ref>Epi Info™ Community Health Assessment Tutorial. The Epi Info™ Community Health Assessment Tutorial was produced by the collaborative efforts of the Centers for Disease Control and Prevention (CDC), the Assessment Initiative (AI), and the New York State Department of Health (NYSDOH). http://www.cdc.gov/epiinfo/communityhealth.htm </ref><sup>,</sup><ref>Cholera Outbreak in Rwenshama: Using Epi Info for Windows in an Outbreak Investigation. Coordinating Office for Global Health - DGPHCD, http://www.cdc.gov/cogh/dgphcd/training/softwaretraining.htm</ref>. The CDC page also lists a video slide show tutorial from the University of Nebraska <ref>Introduction to EPI2000. GPVEC Great Plains Veterinary Educational Center. University of Nebraska - Lincoln. http://gpvec.unl.edu/videos/epi-stats.asp </ref>, and another site has on line training classes<ref>The North Carolina Center for Public Health Preparedness Training Website http://nccphp.sph.unc.edu/training/index.html</ref>. R has a large number of tutorials and manuals, in English and other languages<ref>Contributed Documentation. http://cran.r-project.org/other-docs.html.<br/><br/>William Revelle, Using R for psychological research: A simple guide to an elegant package, 2008, http://personality-project.org/r/<br/><br/>Dong-Yun Kim, MAT 356 R Tutorial, Spring 2004. http://www.math.ilstu.edu/dhkim/Rstuff/Rtutor.html</ref>, and a faq site<ref>R FAQ. Frequently Asked Questions on R. Version 2.8.2009-03-18. ISBN 3-900051-08-9 http://lib.stat.cmu.edu/R/CRAN/doc/FAQ/R-FAQ.html</ref>. A few of the packages have a email discussion lists including R<ref>R-help -- Main R Mailing List: Primary help. https://stat.ethz.ch/mailman/listinfo/r-help</ref> and PSPP<ref>Pspp-users -- PSPP user discussion, http://lists.gnu.org/mailman/listinfo/pspp-users</ref>. | |||
Most of the packages have on line manuals, guides or help pages. These manuals or guides are useful when there are questions about specific procedures or statistical tests. Some manuals or guides are for R<ref name=Rintro> R Development Core Team. An Introduction to R. Version 2.8.1 (2008-12-22). ISBN 3-900051-12-7. http://cran.r-project.org/doc/manuals/R-intro.html</ref>, EasyReg <ref name=easyregguide>Herman J. Bierens. EasyReg International: Guided tours. No Date Given. http://econ.la.psu.edu/~hbierens/ERITOURS.HTM</ref>, OpenStat<ref name=openstat/>, PSPP<ref name=psppref>Documentation, No Date Given. PSPP. http://www.gnu.org/software/pspp/documentation.html</ref>, Vista<ref name=vistaguide>Forrest W. Young, 1996. ViSta User's Guide. http://forrest.psych.unc.edu/research/</ref>, WinIdams<ref name=winidamsguide>P.S. Nagpaul. 1999. Guide to Advanced Data Analysis using IDAMS Software. http://www.unesco.org/webworld/idams/advguide/TOC.htm</ref><sup>,</sup><ref name=winidamsguide2>Unesco. 2008. WinIDAMS 1.3 Reference Manual - Table of Contents. http://www.unesco.org/webworld/portal/idams/html/english/TOC.htm</ref>, Microsiris<ref name=micromanual>Van Eck, Richard, Microsiris, Statistical and Data Management Software System. Version 9.1, 2006. Van Eck Computer Consulting. http://www.microsiris.com/MicrOsiris.htm</ref> and Zelig<ref name=zelig/>. The CDC EpiInfo site itself does not have a manual, but one faculty member from Emory's School of Public Health has an introductory manual<ref name=epiinfoguide>Kevin M. Sullivan. Mar 3 2008. Introduction to Epi Info (Version 3.4.1) Analyze Data Module. http://www.sph.emory.edu/~cdckms/</ref>. | |||
Finally, there are a number of commercial packages such as SAS<ref>http://www.sas.com/</ref>, SPSS<ref>http://www.spss.com/</ref> and many others <ref>Statistics.com list of commercial software http://www.statistics.com/resources/software/commercial/fulllist.php3</ref>. Most of the major commercial and free packages have many statistical procedures in common. The main reason to use free packages is probably the cost. | |||
===Menu driven packages=== | |||
Many of the packages have some kind of opening menu that is used to get or enter the data, manipulate the data, and select the statistical analysis. One example of an opening menu, from MicrOsiris, is this: | |||
{{Image|Microsirismenu.JPG|center|500px|MicrOsiris starting menu.}} | |||
Then after starting the program, people generally get data, either from previously saved data sets, or importing from some other format. For example, MicrOsiris has an import menu like this: | |||
{{Image|microsirsimport.JPG|center|500px|MicrOsiris import menu.}} | |||
From this menu, data files in various formats can be imported. For example if the data is in CSV form (text with commas between values), the program recognizes the format and creates a data set from the CSV file. | |||
Finally, people can use the program to do some analysis. Again, using MicrOsiris as an example, the menu for regression looks like this: | |||
{{Image|microsirisregression.JPG|center|500px|MicrOsiris regression menu.}} | |||
In this analysis menu, people can select the variables of interest, along with other options. Then the analysis is run and results are obtained. | |||
===Command driven packages=== | |||
A few programs, like WinIDAMS and R need commands for many of their procedures. WinIDAMS does have an interactive menu to read in data, but then specific statistical procedures need a set of text commands. For example, the text command lines for frequencies look like this: | |||
:$COMMENT basic freqs of testing data<br/> | |||
:$RUN TABLES<br/> | |||
:$FILES<br/> | |||
:DICTIN = PD_data_idams.dic<br/> | |||
:DATAIN = PD_data_idams.dat<br/> | |||
:$SETUP<br/> | |||
:FREQUENCY TABLES<br/> | |||
:PRINT=(CDICT)<br/> | |||
:TABLES<br/> | |||
:ROWVARS=(V21) CELLS=(ROWP,FREQS)<br/> | |||
This set of commands identifies procedure (tables), the data set and dictionary (PD_data_idams.dat and dic) and the variables. The procedures all have various options outlined in the manuals. | |||
===Getting data=== | |||
Most packages are able to import data from excel or CSV (text with commas separating values). | |||
One consideration is whether there are missing data. Some packages, like PSPP and MicrOsiris, can automatically deal with the missing data. So for example, say one set of data look like this: | |||
{| class="wikitable" | |||
|- | |||
! Name | |||
! Age | |||
! Sex | |||
! Born in US | |||
! Degree | |||
|- | |||
| Joe | |||
| 31 | |||
| M | |||
| Yes | |||
| BA | |||
|- | |||
| Sam | |||
| | |||
| M | |||
| No | |||
| MS | |||
|- | |||
| Sally | |||
| 28 | |||
| F | |||
| | |||
| Ph.D. | |||
|} | |||
In this data set, Sam is missing age, and Sally is missing whether she was born in the USA. When some packages, like PSPP or MicrOsiris, read in or import the original data set, the packages will recognize that those values are missing, and do their calculations accordingly. MicrOsiris automatically assigns 1.5 or 1.6 billion to blanks as missing, and these values are excluded from analysis<ref name=micromanual/>. | |||
Other packages need a 'place holder', such as '-9' where there is missing data<ref>Unesco, How to work with WinIDAMS. Section on Missing data values. http://www.unesco.org/webworld/idams/selfteaching/eng/emissing-data.htm</ref>. Before the package is used to read the data, the data set has to be edited to put in place holder where there are missing data. So for example: | |||
{| class="wikitable" | |||
|- | |||
! Name | |||
! Age | |||
! Sex | |||
! Born in US | |||
! Degree | |||
|- | |||
| Joe | |||
| 31 | |||
| M | |||
| Yes | |||
| BA | |||
|- | |||
| Sam | |||
| -9 | |||
| M | |||
| No | |||
| MS | |||
|- | |||
| Sally | |||
| 28 | |||
| F | |||
| -9 | |||
| Ph.D. | |||
|} | |||
The data set includes '-9' and then people who are reading in the data need to tell the program that the -9 means missing data. | |||
==Limitations of packages== | |||
Most of the packages have limitations of some sort. | |||
Variables in WidIDAMS are limited to 9 digits in length<ref name=winidamsguide2/> and so have to be manipulated before analysis. In the version of PSPP current as of April 2009, there are a limited number of procedures available, including means, frequencies, crosstabs, two non-parametric tests, t-tests, anova and basic regression<ref name=psppref/>. In addition, the output is, apparently, not easy to use as it cannot be copied and pasted to other applications, and it is not clear where, in Windows Vista, the output is saved<ref>Re: using results of pspp, error report. Email thread about PSPP error handling, Pspp-users -- PSPP user discussion. April 2009, http://lists.gnu.org/archive/html/pspp-users/2009-04/msg00037.html</ref>. Several of the programs, including Easyreg, Epidata and Instat, do not appear to handle missing data or do not handle it well<ref name=shackman/>. While EpiInfo has many statistical procedures, correlation is not one of them. Rather correlation is found by regression<ref>CDC. Epi Info Training Session. Using Epi Info in an Outbreak Investigation. Advanced Analysis and Mapping. http://www.cdc.gov/cogh/dgphcd/training/softwaretraining.htm</ref>. This means that EpiInfo will not produce a single table showing correlations among multiple variables. According to the Zelig installation manual, use of Zelig requires that R and several of it's libraries already be installed, and the installation also requires some degree of background in R<ref name=zelig/>. One limit of MicrOsiris is in handling the output. When calculations are complete, the output pages through the results, but various menu boxes also appear over the results, and so the results cannot be accessed. The output can be saved, though, as a text file and then used. | |||
One limitation is specific to programs that were developed by individuals. Support for these programs is limited to the time that the author has available. While the authors may, and often do, respond fairly quickly when there are few people asking questions, if too many people ask questions or the author is otherwise busy, support would correspondingly be slower. | |||
R is the work of a group of people so can have a lot of support. However, while R is powerful, there can also be a steep learning curve<ref>Gillian Raab, Susan Purdon, Kathy Buckner and Iona Waterston. The R Package. Napier University (Edinburgh) and the National Centre for Social Research (London). http://www2.napier.ac.uk/depts/fhls/peas/rpackage.asp</ref>. | |||
==References== | ==References== | ||
<references/> | <references/>[[Category:Suggestion Bot Tag]] |
Latest revision as of 16:01, 18 August 2024
Free statistical software is a practical alternative to commercial packages. In general, free statistical software gives results that are the same as the results from commercial programs, and many of the packages are fairly easy to learn, using menu systems, although a few are command-driven. These packages come from a variety of sources, including governments, nongovernmental organizations (NGOs) like UNESCO, and universities, and are also developed by individuals.
Some packages are developed for specific purposes (e.g., time series analysis, factor analysis, calculators for probability distributions, etc.), while others are general packages, with a variety of statistical procedures. This article is a review of the general statistical packages.
Brief history of free statistical software
Some of the free software packages are from governmental or NGO organizations, such as Epi Info, from CDC[1], and IDAMS from UNESCO[2]. Some other software packages are from smaller or independent organizations or universities, such as Instat[3] or Irristat[4]. Another package, the R Project[5] is being developed by a group of volunteer individuals. A large proportion of free statistical software packages, however, are from individuals. Some of these software packages from individuals include Easyreg[6], MicrOsiris[7], OpenStat[8], PSPP[9], SOFA[10] and Zelig[11].
At least one package, WinIDAMS, was developed for the purposes of making key technologies available to those who could not otherwise afford them, to empower development[12]. OpenStat and Instat were developed as teaching aids[8],[3]. Other packages were developed for specific purposes but can be more generally used. Examples are Irristat[4], developed for agricultural analysis, and Epi Info[1], developed for public health. Several of the packages, PSPP, R and Osiris don't appear to give any statements about why they were developed, other than just general use for statistical analysis.
These free software packages have been used in a number of scholarly publications. For example, OpenStat was used in a research letter to JAMA[13] and in several published studies[14]. Irristat is used in this agricultural report[15], EasyReg is listed or used in these papers[16], EpiInfo was used in these papers[17], R was used in these papers[18] and WinIdams was used in these papers[19].
While Microsiris doesn't appear to be used in academic research, the author of the program was one of the original authors of OSIRIS[20], which was the starting program from which WinIdams was developed[21]. The author of Microsiris also has also contributed or co-contributed several components to WinIdams[21].
Reviews of free statistical software
There are a few reviews of free statistical software. There were two reviews in journals (but not peer reviewed), one by Zhu and Kuljaca[22] and another article by Grant that included mainly a brief review of the R Project[23]. Zhu and Kuljaca outlined some useful characteristics of software, such as ease of use, having a number of statistical procedures and ability to develop new procedures. They reviewed several programs and identified which ones, at that time, had the most functionality. At that time, several of the programs may not have had all of the desired ability for advanced statistics. Grant reviewed some of the programing features of R, and briefly mentioned the availability of other programs. One other paper reviewed statistical packages, mainly commercial, but includes R[24]. One article reviewed EasyReg and included a discussion of it's accuracy[25].
Only one review has compared the output of various packages[26]. In this review, all of the packages read either CSV (Comma Separated Values - text files in which all values are separated by commas) files or excel format. All of the packages gave exactly the same results for correlation and regression. The free software packages also gave the same regression results as did excel. One of the main differences among the packages was how they handled missing data. With the example data sets used in the review, and for the package versions available in November 2006 when this review was conducted, two packages, MicrOsiris and Epi Info, could read files with blanks for missing. Two other programs, Stat4U and WinIdams need something for the missing, like -9 or -9.99. The other packages could only handle data sets with no missing values.
Two websites that list software also have very brief reviews of each package. These two sites are StatCon[27] and by Pezzullo[28]. These sites mainly offer a brief list of the features available in the packages. Similarly, one other web site compares the statistical procedures available on free statistical packages[29]. In this review, R had all of the procedures, OpenStat had 16, MacAnova had 15, and Microsiris had 12. The others had from 8 to 11 of the procedures.
There is also a journal specifically for statistical software[30], although the main focus is on commercial software, R and some coding snippets.
In contrast, there are various reviews of commercial statistical software, such as a comparison between several major packages[31] and a brief review of several packages[32].
Using free statistical software
Before using any statistical packages, it is generally a good idea to have a solid background in Statistics. Then the packages can be used to the best advantage, for example, to choose the most appropriate test, to make sure all the necessary assumptions are met, so that the appropriate conclusions can be drawn.
Once the statistical issues are understood, the next step is to decide which package to use. Most of these packages are menu driven, and can be learned a couple of hours at most, except R, which is generally code driven and requires a much longer time to learn, and to some extent CDC's Epi Info, which also takes some time to learn.
Several of the packages also have tutorials. These tutorials help with a basic introduction and learning the basics of programs. For example, CDC has these tutorials about Epi Info[33],[34]. The CDC page also lists a video slide show tutorial from the University of Nebraska [35], and another site has on line training classes[36]. R has a large number of tutorials and manuals, in English and other languages[37], and a faq site[38]. A few of the packages have a email discussion lists including R[39] and PSPP[40].
Most of the packages have on line manuals, guides or help pages. These manuals or guides are useful when there are questions about specific procedures or statistical tests. Some manuals or guides are for R[41], EasyReg [42], OpenStat[8], PSPP[43], Vista[44], WinIdams[45],[46], Microsiris[47] and Zelig[11]. The CDC EpiInfo site itself does not have a manual, but one faculty member from Emory's School of Public Health has an introductory manual[48].
Finally, there are a number of commercial packages such as SAS[49], SPSS[50] and many others [51]. Most of the major commercial and free packages have many statistical procedures in common. The main reason to use free packages is probably the cost.
Menu driven packages
Many of the packages have some kind of opening menu that is used to get or enter the data, manipulate the data, and select the statistical analysis. One example of an opening menu, from MicrOsiris, is this:
Then after starting the program, people generally get data, either from previously saved data sets, or importing from some other format. For example, MicrOsiris has an import menu like this:
From this menu, data files in various formats can be imported. For example if the data is in CSV form (text with commas between values), the program recognizes the format and creates a data set from the CSV file.
Finally, people can use the program to do some analysis. Again, using MicrOsiris as an example, the menu for regression looks like this:
In this analysis menu, people can select the variables of interest, along with other options. Then the analysis is run and results are obtained.
Command driven packages
A few programs, like WinIDAMS and R need commands for many of their procedures. WinIDAMS does have an interactive menu to read in data, but then specific statistical procedures need a set of text commands. For example, the text command lines for frequencies look like this:
- $COMMENT basic freqs of testing data
- $RUN TABLES
- $FILES
- DICTIN = PD_data_idams.dic
- DATAIN = PD_data_idams.dat
- $SETUP
- FREQUENCY TABLES
- PRINT=(CDICT)
- TABLES
- ROWVARS=(V21) CELLS=(ROWP,FREQS)
This set of commands identifies procedure (tables), the data set and dictionary (PD_data_idams.dat and dic) and the variables. The procedures all have various options outlined in the manuals.
Getting data
Most packages are able to import data from excel or CSV (text with commas separating values).
One consideration is whether there are missing data. Some packages, like PSPP and MicrOsiris, can automatically deal with the missing data. So for example, say one set of data look like this:
Name | Age | Sex | Born in US | Degree |
---|---|---|---|---|
Joe | 31 | M | Yes | BA |
Sam | M | No | MS | |
Sally | 28 | F | Ph.D. |
In this data set, Sam is missing age, and Sally is missing whether she was born in the USA. When some packages, like PSPP or MicrOsiris, read in or import the original data set, the packages will recognize that those values are missing, and do their calculations accordingly. MicrOsiris automatically assigns 1.5 or 1.6 billion to blanks as missing, and these values are excluded from analysis[47].
Other packages need a 'place holder', such as '-9' where there is missing data[52]. Before the package is used to read the data, the data set has to be edited to put in place holder where there are missing data. So for example:
Name | Age | Sex | Born in US | Degree |
---|---|---|---|---|
Joe | 31 | M | Yes | BA |
Sam | -9 | M | No | MS |
Sally | 28 | F | -9 | Ph.D. |
The data set includes '-9' and then people who are reading in the data need to tell the program that the -9 means missing data.
Limitations of packages
Most of the packages have limitations of some sort.
Variables in WidIDAMS are limited to 9 digits in length[46] and so have to be manipulated before analysis. In the version of PSPP current as of April 2009, there are a limited number of procedures available, including means, frequencies, crosstabs, two non-parametric tests, t-tests, anova and basic regression[43]. In addition, the output is, apparently, not easy to use as it cannot be copied and pasted to other applications, and it is not clear where, in Windows Vista, the output is saved[53]. Several of the programs, including Easyreg, Epidata and Instat, do not appear to handle missing data or do not handle it well[26]. While EpiInfo has many statistical procedures, correlation is not one of them. Rather correlation is found by regression[54]. This means that EpiInfo will not produce a single table showing correlations among multiple variables. According to the Zelig installation manual, use of Zelig requires that R and several of it's libraries already be installed, and the installation also requires some degree of background in R[11]. One limit of MicrOsiris is in handling the output. When calculations are complete, the output pages through the results, but various menu boxes also appear over the results, and so the results cannot be accessed. The output can be saved, though, as a text file and then used.
One limitation is specific to programs that were developed by individuals. Support for these programs is limited to the time that the author has available. While the authors may, and often do, respond fairly quickly when there are few people asking questions, if too many people ask questions or the author is otherwise busy, support would correspondingly be slower.
R is the work of a group of people so can have a lot of support. However, while R is powerful, there can also be a steep learning curve[55].
References
- ↑ 1.0 1.1 Epi Info, CDC, 2008 http://www.cdc.gov/epiinfo/index.htm.
- ↑ IDAMS Statistical Software, http://portal.unesco.org/ci/en/ev.php-URL_ID=2070&URL_DO=DO_TOPIC&URL_SECTION=201.html
- ↑ 3.0 3.1 Instat - an interactive statistical package, Statistical Services Centre - University of Reading, 2009. http://www.ssc.rdg.ac.uk/software/instat/instat.html
- ↑ 4.0 4.1 Irristat, International Rice Research Instititue, Biometrics and Bioinformatics Unit, http://www.irri.org/science/software/irristat.asp
- ↑ The R Project, http://cran.r-project.org/
- ↑ Easy Reg International, Herman Bierens, Penn State University, 2008 http://econ.la.psu.edu/~hbierens/EASYREG.HTM
- ↑ MicOsiris, Neal Van Eck, Van Eck Computer Consulting http://www.microsiris.com/
- ↑ 8.0 8.1 8.2 OpenStat, Bill Miller, 2009 http://www.statpages.org/miller/openstat/
- ↑ PSPP, 2008 http://www.gnu.org/software/pspp/
- ↑ Dr Grant Paton-Simpson, SOFA - Statistics Open For All, http://www.sofastatistics.com/home.php
- ↑ 11.0 11.1 11.2 Imai, Kosuke, Gary King and Olivia Lau. 2006. “Zelig: Everyone’s Statistical Software,” http://GKing.Harvard.Edu/zelig.
- ↑ UNESCO. 03-11-2004 . In Focus: Communication and Information Sector's In Focus service. UNESCO and Software. http://portal.unesco.org/ci/en/ev.php-URL_ID=17447&URL_DO=DO_TOPIC&URL_SECTION=201.html
- ↑ Future Salary and US Residency Fill Rate Revisited, Mark Ebell. Research letter in JAMA, September 10, 2008—Vol 300, No. 10, p1131-1132. http://jama.ama-assn.org/cgi/reprint/300/10/1131
- ↑ Differential gene expression patterns in cyclooxygenase-1 and cyclooxygenase-2 deficient mouse brain. Christopher D Toscano, Vinaykumar V Prabhu, Robert Langenbach, Kevin G Becker, and Francesca Bosetti. Genome Biol. 2007; 8(1): R14. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1839133
M Bielaszewska, B Sinha, T Kuczius and H Karch. Cytolethal Distending Toxin from Shiga Toxin-Producing Escherichia coli O157 Causes Irreversible G2/M Arrest, Inhibition of Proliferation, and Death of Human Endothelial Cells. Infection and Immunity, January 2005, p. 552-562, Vol. 73, No. 1. http://iai.asm.org/cgi/content/full/73/1/552
C.D. Toscano, P.J. Kingsley, L.J. Marnett, and F. Bosetti1. NMDA-induced Seizure Intensity is Enhanced in COX-2 Deficient Mice. Neurotoxicology. 2008 November; 29(6): 1114–1120.http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2587528
- ↑ FAO Plant Production and Protection Paper No. 174, Rome, 2003, Genotype x environment interactions. Challenges and opportunities for plant breeding and cultivar recommendations, http://www.fao.org/DOCREP/005/Y4391E/y4391e00.htm
- ↑ A Gambardella and Bronwyn H. Hall, "Proprietary versus public domain licensing of software and research products" (2006). Research Policy. 35 (6), pp. 875-892. Postprint available free at: http://repositories.cdlib.org/postprints/1865.
Liu, Wen-Chi and Tsangyao Chang, (2008) "Rational Bubbles in the Korea Stock Market? Further Evidence based on Nonlinear and Nonparametric Cointegration Tests." Economics Bulletin, Vol. 3, No. 34 pp. 1-12. http://economicsbulletin.vanderbilt.edu/2008/volume3/EB-08C30021A.pdf
Harumi Itoa and Darin Lee, Journal of Economics and Business, Volume 57, Issue 1, January-February 2005, Pages 75-95. Assessing the impact of the September 11 terrorist attacks on U.S. airline demand. http://dx.doi.org/10.1016/j.jeconbus.2004.06.003. Also available here http://www.brown.edu/Departments/Economics/Papers/Papers/2003/2003-16_paper.pdf
- ↑ Rahav G, Gabbay R, Ornoy A, Shechtman S, Arnon J, Diav-Citrini O. Primary versus nonprimary cytomegalovirus infection during pregnancy, Israel. Emerg Infect Dis [serial on the Internet]. 2007 Nov [May 15, 2009]. Available from http://www.cdc.gov/EID/content/13/11/1791.htm
Chan P-C, Huang L-M, Wu Y-C, Yang H-L, Chang I-S, Lu C-Y, et al. Tuberculosis in children and adolescents, Taiwan, 1996–2003. Emerg Infect Dis [serial on the Internet]. 2007 Sep. Available from http://www.cdc.gov/EID/content/13/9/1361.htm
ME Gyasi, WMK Amoaku, and MA Adjuik. Epidemiology of Hospitalized Ocular Injuries in the Upper East Region of Ghana. Ghana Med J. 2007 December; 41(4): 171–175. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2350113
- ↑ Mark S. Handcock, David R. Hunter, Carter T. Butts, Steven M. Goodreau, and Martina Morris. statnet: Software Tools for the Representation, Visualization, Analysis and Simulation of Network Data. J Stat Softw. 2008; 24(1): 1548–7660. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2447931
Michael E. Hume, Charles M. Scanlan, Roger B. Harvey, Kathleen Andrews, James D. Snodgrass, Armen G. Nalian, Alexandra Martynova-Van Kley, and David J. Nisbet. Denaturing Gradient Gel Electrophoresis as a Tool To Determine Batch Similarity of Probiotic Cultures of Porcine Cecal Bacteria. Applied and Environmental Microbiology, August 2008, p. 5241-5243, Vol. 74, No. 16. http://aem.asm.org/cgi/content/abstract/74/16/5241
Max Bylesjö, Jeremy K Nicholson, Elaine Holmes and Johan Trygg. BMC Bioinformatics 2008, 9:106. http://www.biomedcentral.com/1471-2105/9/106
- ↑ N. S. Sapre, N. Pancholi, and S. Gupta, Computational Modeling of Substitution Effect on HIV–1 Non–Nucleoside Reverse Transcriptase Inhibitors with Kier–Hall Electrotopological State (E–state) Indices, Internet Electron. J. Mol. Des. 2008, 7, 55–67, http://www.biochempress.com/cv07_i03.html
Chawla, Anju. Exploring project selection behavior of academic scientists in India. Research Evaluation, Volume 16, Number 1, March 2007 , pp. 35-45(11). http://www.ingentaconnect.com/content/beech/rev/2007/00000016/00000001/art00004
- ↑ Data Sharing for Demographic Research Knowledge Base, question on OSIRIS, University of Michigan, http://dsdr-kb.psc.isr.umich.edu/answer.html?i=1076
- ↑ 21.0 21.1 IDAMS, Internationally Developed Data Analysis and Management Software Package. WinIDAMS Reference Manual (release 1.3) UNESCO, 2008. Preface. http://portal.unesco.org/ci/en/ev.php-URL_ID=25081&URL_DO=DO_TOPIC&URL_SECTION=-465.html
- ↑ "A Short Preview of Free Statistical Software Packages for Teaching Statistics to Industrial Technology Majors" Journal of Industrial Technology (Volume 21-2, April 2005), Ms. Xiaoping Zhu and Dr. Ognjen Kuljaca. http://www.nait.org/jit/current.html
- ↑ Felix Grant, "Free Statistics Software, Yours, Free to keep....", Scientific Computing World, Sept/Oct 2004, http://www.scientific-computing.com/scwsepoct04free_statistics.html
- ↑ Edward J. Wegman and Jeffrey L. Solka. 2005. Statistical Software for Today and Tomorrow. http://www.galaxy.gmu.edu/ (listed as "A Guide to Statistical Software".
- ↑ Hwan-sik Choia and Nicholas M. Kiefer, Software evaluation: EasyReg International. International Journal of Forecasting. Volume 21, Issue 3, July-September 2005, Pages 609-616. http://dx.doi.org/10.1016/j.ijforecast.2005.02.003
- ↑ 26.0 26.1 Shackman, Gene. 2006. "Comparing free statistical software for data sets with no missing values" and "Comparing free statistical software, Handling missing data". Both available here "Free Software" http://gsociology.icaap.org/methods/soft.html
- ↑ List of free statistical software, Open Source & Public Domain Packages with Source Code. StatCon 2006. http://statistiksoftware.com/free_software.html
- ↑ Pezzullo, Free Statistical Software, 2009. http://statpages.org/javasta2.html
- ↑ Andrea Corsini. 2009. Free Statistics. Free statistical software comparisons. http://en.freestatistics.info/comp.php
- ↑ Journal of Statistical Software, http://www.jstatsoft.org/
- ↑ Acock, Alan C. “SAS, Stata, SPSS: A Comparison”. Journal of Marriage and Family, November 2005, Vol.67, pp. 1093-1095). Summarized in Hom, Willard. 2006. Choosing Between SAS, Stata, and SPSS. http://www.cccco.edu/SystemOffice/Divisions/TechResearchInfo/ResearchandPlanning/AbstractsofResearch/ResearchMethods/tabid/302/Default.aspx
- ↑ Wass, John. No date. Comparative Statistical Software Review. Tabulations and musings from your editor's biased perspective. Scientific Computing. http://www.scientificcomputing.com/comparative-statistical-software.aspx
- ↑ Epi Info™ Community Health Assessment Tutorial. The Epi Info™ Community Health Assessment Tutorial was produced by the collaborative efforts of the Centers for Disease Control and Prevention (CDC), the Assessment Initiative (AI), and the New York State Department of Health (NYSDOH). http://www.cdc.gov/epiinfo/communityhealth.htm
- ↑ Cholera Outbreak in Rwenshama: Using Epi Info for Windows in an Outbreak Investigation. Coordinating Office for Global Health - DGPHCD, http://www.cdc.gov/cogh/dgphcd/training/softwaretraining.htm
- ↑ Introduction to EPI2000. GPVEC Great Plains Veterinary Educational Center. University of Nebraska - Lincoln. http://gpvec.unl.edu/videos/epi-stats.asp
- ↑ The North Carolina Center for Public Health Preparedness Training Website http://nccphp.sph.unc.edu/training/index.html
- ↑ Contributed Documentation. http://cran.r-project.org/other-docs.html.
William Revelle, Using R for psychological research: A simple guide to an elegant package, 2008, http://personality-project.org/r/
Dong-Yun Kim, MAT 356 R Tutorial, Spring 2004. http://www.math.ilstu.edu/dhkim/Rstuff/Rtutor.html - ↑ R FAQ. Frequently Asked Questions on R. Version 2.8.2009-03-18. ISBN 3-900051-08-9 http://lib.stat.cmu.edu/R/CRAN/doc/FAQ/R-FAQ.html
- ↑ R-help -- Main R Mailing List: Primary help. https://stat.ethz.ch/mailman/listinfo/r-help
- ↑ Pspp-users -- PSPP user discussion, http://lists.gnu.org/mailman/listinfo/pspp-users
- ↑ R Development Core Team. An Introduction to R. Version 2.8.1 (2008-12-22). ISBN 3-900051-12-7. http://cran.r-project.org/doc/manuals/R-intro.html
- ↑ Herman J. Bierens. EasyReg International: Guided tours. No Date Given. http://econ.la.psu.edu/~hbierens/ERITOURS.HTM
- ↑ 43.0 43.1 Documentation, No Date Given. PSPP. http://www.gnu.org/software/pspp/documentation.html
- ↑ Forrest W. Young, 1996. ViSta User's Guide. http://forrest.psych.unc.edu/research/
- ↑ P.S. Nagpaul. 1999. Guide to Advanced Data Analysis using IDAMS Software. http://www.unesco.org/webworld/idams/advguide/TOC.htm
- ↑ 46.0 46.1 Unesco. 2008. WinIDAMS 1.3 Reference Manual - Table of Contents. http://www.unesco.org/webworld/portal/idams/html/english/TOC.htm
- ↑ 47.0 47.1 Van Eck, Richard, Microsiris, Statistical and Data Management Software System. Version 9.1, 2006. Van Eck Computer Consulting. http://www.microsiris.com/MicrOsiris.htm
- ↑ Kevin M. Sullivan. Mar 3 2008. Introduction to Epi Info (Version 3.4.1) Analyze Data Module. http://www.sph.emory.edu/~cdckms/
- ↑ http://www.sas.com/
- ↑ http://www.spss.com/
- ↑ Statistics.com list of commercial software http://www.statistics.com/resources/software/commercial/fulllist.php3
- ↑ Unesco, How to work with WinIDAMS. Section on Missing data values. http://www.unesco.org/webworld/idams/selfteaching/eng/emissing-data.htm
- ↑ Re: using results of pspp, error report. Email thread about PSPP error handling, Pspp-users -- PSPP user discussion. April 2009, http://lists.gnu.org/archive/html/pspp-users/2009-04/msg00037.html
- ↑ CDC. Epi Info Training Session. Using Epi Info in an Outbreak Investigation. Advanced Analysis and Mapping. http://www.cdc.gov/cogh/dgphcd/training/softwaretraining.htm
- ↑ Gillian Raab, Susan Purdon, Kathy Buckner and Iona Waterston. The R Package. Napier University (Edinburgh) and the National Centre for Social Research (London). http://www2.napier.ac.uk/depts/fhls/peas/rpackage.asp