<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Ron Zacharski &#187; News</title>
	<atom:link href="http://www.zacharski.org/category/news/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.zacharski.org</link>
	<description></description>
	<lastBuildDate>Mon, 22 Mar 2010 01:56:14 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Submitted paper to topiCS</title>
		<link>http://www.zacharski.org/2010/03/05/submitted-paper-to-topics/</link>
		<comments>http://www.zacharski.org/2010/03/05/submitted-paper-to-topics/#comments</comments>
		<pubDate>Fri, 05 Mar 2010 17:27:09 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Research]]></category>

		<guid isPermaLink="false">http://www.zacharski.org/?p=701</guid>
		<description><![CDATA[Jeanette Gundel (University of Minnesota), Nancy Hedberg (Simon Fraser University) and I just submitted a paper to the new journal topiCS (topics in Cognitive Science). This pretty much consumed my entire spring break. The title of the paper is Underspecification of Cognitive Status in Reference Production: Some Empirical Predictions. Here&#8217;s the abstract. Within the Givenness [...]]]></description>
			<content:encoded><![CDATA[<p>Jeanette Gundel (University of Minnesota), Nancy Hedberg (Simon Fraser University) and I just submitted a paper to the new journal topiCS (topics in Cognitive Science). This pretty much consumed my entire spring break. The title of the paper is Underspecification of Cognitive Status in Reference Production: Some Empirical Predictions. Here&#8217;s the abstract. <span id="more-701"></span></p>
<p>Within the Givenness Hierarchy framework of Gundel, Hedberg, &amp; Zacharski (1993), lexical items included in referring forms are assumed to conventionally encode two kinds of information: conceptual information about the speaker&#8217;s intended referent and procedural information about the assumed cognitive status of that referent in the mind of the addressee, which is encoded by various determiners and pronouns. In this paper, we focus on effects of underspecification of cognitive status, establishing that, while salience and degree of accessibility play an important role in reference production and understanding, the Givenness Hierarchy itself is not a hierarchy of degrees of salience/accessibility, contrary to what has often been assumed. The framework is thus able to account for a number of interesting experimental results in the literature without making additional assumptions about form-specific constraints</p>
<div id="_mcePaste">(<a href="/files/papers/Gundel_Hedberg_Zacharski_TopiCS.pdf">pdf</a>) of full paper</div>
]]></content:encoded>
			<wfw:commentRss>http://www.zacharski.org/2010/03/05/submitted-paper-to-topics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hackintosh</title>
		<link>http://www.zacharski.org/2009/12/31/hackintosh/</link>
		<comments>http://www.zacharski.org/2009/12/31/hackintosh/#comments</comments>
		<pubDate>Thu, 31 Dec 2009 01:55:17 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Misc]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[Hackintosh]]></category>

		<guid isPermaLink="false">http://roz.rockettools.com/?p=26</guid>
		<description><![CDATA[Over Christmas break I built a Hackintosh. I was partly inspired by my son, Adam, who built, a few months ago, an Intel i7 920 based Hackintosh using a solid state hard drive, an ASUS P6T Deluxe V2 motherboard, 6GB of Corsair memory, and a Sapphire Radeon HD4870 1GB DDR5 Dual DVI / TVO PCI-Express Graphics Card. I was also inspired by a Hackintosh how-to article on Lifehacker.]]></description>
			<content:encoded><![CDATA[<p>Over Christmas break I built a Hackintosh. I was partly inspired by my son, Adam, who built, a few months ago, an Intel i7 920 based Hackintosh using a solid state hard drive, an ASUS P6T Deluxe V2 motherboard, 6GB of Corsair memory, and a Sapphire Radeon HD4870 1GB DDR5 Dual DVI / TVO PCI-Express Graphics Card.<span id="more-26"></span> I was also inspired by <a href="http://lifehacker.com/5351485/how-to-build-a-hackintosh-with-snow-leopard-start-to-finish">a Hackintosh how-to article</a> on Lifehacker. I could have gone the safe route and used the exact components of Adam’s or the Lifehacker build. Instead i decided to build a Hackintosh based on the Intel i7 860 Lynnfield. From reports on tech websites the 860 seems like a slightly better processor than the 920 (for example, <a href="http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=3641">this anandtech review)</a>. Both are of a 4 core/8 thread design, but the 860 has a slightly better clock speed and a higher single core turbo frequency. My build included the Gigabyte GA-P55-UD3R, 8GB of Patriot memory, and a GeForce 9800 video card (based mainly on its compatibility with Snow Leopard). The total cost of the build was around $700 (not including a case, power supply, and a CD/DVD drive, which I salvaged from my previous computer–an Ubuntu Box I built).</p>
<p>Regarding installing Snow Leopard, I was unsuccessful in getting either the boot CD method or the USB method to work with this build (both methods described on the  <a href="http://tonymacx86.blogspot.com/2009/12/install-os-x-snow-leopard-directly-from.html">tonymacx86 website</a>). The method that did work was to install the OS on the hard drive by using another Mac. My Hackintosh runs the stock 10.6.2 kernel. I used a DSDT specific to my motherboard and available at <a href="http://tonymacx86.blogspot.com/2009/12/dsdt-database-for-p55-motherboards.html">DSDT Database for P55 Motherboards</a>. To get networking working I used a kext specific to the network chipset. Most of the information on how to do this is available on the tonymacx86 website. Sound does not work even with trying various kexts specific to the audio chipset. I hope to resolve this by using a usb audio interface. Other than sound, everything seems to be working great!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.zacharski.org/2009/12/31/hackintosh/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Arabic Localization</title>
		<link>http://www.zacharski.org/2009/12/26/arabic-localization/</link>
		<comments>http://www.zacharski.org/2009/12/26/arabic-localization/#comments</comments>
		<pubDate>Sat, 26 Dec 2009 01:56:42 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Research]]></category>

		<guid isPermaLink="false">http://roz.rockettools.com/?p=32</guid>
		<description><![CDATA[Over the Christmas break I have been looking at words in Standard Arabic that are more common in one region compared to another. This is a continuation of work I have been doing with Ahmed Abdelali and Steve Helmreich. Ahmed has collected a corpus of Standard Arabic texts from newspapers in Egypt, Sudan, Libya, Syria, [...]]]></description>
			<content:encoded><![CDATA[<p>Over the Christmas break I have been looking at words in Standard Arabic that are more common in one region compared to another. This is a continuation of work I have been doing with Ahmed Abdelali and Steve Helmreich. Ahmed has collected a corpus of Standard Arabic texts from newspapers in Egypt, Sudan, Libya, Syria, and the UK. In previous work we looked at distinguishing texts from different regions using the frequency of common words (the equivalent of common English words such as <em>at, on,</em>and  <em>in</em>).  In this work over Christmas break, I was looking for the difference in the frequency of content words (similar to Amazon’s ’statistically improbably phrases’)–words that occur in texts more frequently than you would expect by chance. <span id="more-32"></span> I used 2 statistics, log likelihood and mutual information. Work by Ted Dunning suggests that log likelihood works better for statistically rare events than mutual information does. Currently I am not sure what to make of the results but here are the top 5 ’statistically improbably’ words from each region (using log likelihood):</p>
<p>Sudan المقاولون الساحل برانكو اعداده الموردة استعدادا<br />
Egypt  المقاولون الساحل برانكو الموردة التضامن اعداده<br />
UK المقاولون الساحل برانكو اعداده التضامن الامل<br />
Libya  مدني الساحل استعدادا الامل الاولمبي حليم<br />
Syria بشار البوكمال المادة النادي الفنان الشاعر</p>
]]></content:encoded>
			<wfw:commentRss>http://www.zacharski.org/2009/12/26/arabic-localization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Delivered paper at the Computational Approaches to Arabic Script-based Languages workshop</title>
		<link>http://www.zacharski.org/2009/08/28/delivered-paper-at-the-computational-approaches-to-arabic-script-based-languages-workshop/</link>
		<comments>http://www.zacharski.org/2009/08/28/delivered-paper-at-the-computational-approaches-to-arabic-script-based-languages-workshop/#comments</comments>
		<pubDate>Fri, 28 Aug 2009 01:59:16 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[Research]]></category>

		<guid isPermaLink="false">http://roz.rockettools.com/?p=48</guid>
		<description><![CDATA[I presented the paper Investigations on standard Arabic geographical classification at theComputational Approaches to Arabic Script-based Languages workshop. Immediately before my talk, I convinced myself that the paper was not related to the conference topic and that it was simplistic. However, it seems that it was well received. One of the conference organizers, Ali Farghaly, [...]]]></description>
			<content:encoded><![CDATA[<p>I presented the paper Investigations on standard Arabic geographical classification at the<a href="http://arabicscript.org/CAASL3/index.html">Computational Approaches to Arabic Script-based Languages</a> workshop. Immediately before my talk, I convinced myself that the paper was not related to the conference topic and that it was simplistic. However, it seems that it was well received. <span id="more-48"></span>One of the conference organizers, Ali Farghaly, said it was important work, which is nice to hear. I probably received a half dozen positive statements from people and I am extremely grateful for their kind words. Several people offered great suggestions for future work. Most of them related to trying to identify content words that may help in the geographical classification. Several people suggested using Term Frequency Inverse Document Frequency (TFIDF). Prior to the workshop I was thinking of using log likelihood or mutual information to do a similar identification and several people suggested similar approaches. I am thankful that people took the time to offer these suggestions.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.zacharski.org/2009/08/28/delivered-paper-at-the-computational-approaches-to-arabic-script-based-languages-workshop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Paper accepted to Arabic Script Languages Workshop</title>
		<link>http://www.zacharski.org/2009/07/23/paper-accepted-to-arabic-script-languages-workshop/</link>
		<comments>http://www.zacharski.org/2009/07/23/paper-accepted-to-arabic-script-languages-workshop/#comments</comments>
		<pubDate>Thu, 23 Jul 2009 01:59:35 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Research]]></category>

		<guid isPermaLink="false">http://roz.rockettools.com/?p=50</guid>
		<description><![CDATA[I am grateful that the paper Ahmed Abdelali, Steve Helmreich, and I worked on was accepted at the Computational Approaches to Arabic Script-based Languages workshop to be held August 26th in Ottawa (workshop program). I would also like to thank to the three reviewers for their helpful comments. Here is the conclusion. Our work focused [...]]]></description>
			<content:encoded><![CDATA[<p>I am grateful that the paper Ahmed Abdelali, Steve Helmreich, and I worked on was accepted at the Computational Approaches to Arabic Script-based Languages workshop to be held August 26th in Ottawa (<a href="http://www.arabicscript.org/CAASL3/program.html">workshop program</a>). I would also like to thank to the three reviewers for their helpful comments. Here is the conclusion.<span id="more-50"></span></p>
<p>Our work focused on answering two questions: (1) can we geographically classify documents solely on the frequency of common words, and, (2) rather than dialects, can we classify regional variations in one dialect (for example, can we classify regional differences in Modern Standard Arabic). We developed a series of studies aimed at answering these questions. These studies showed that it is possible to accurately classify newspaper documents solely using the common words in the documents. One study compared the performance of 10 classifiers on this task and provided some evidence that Bagging C4.5, C4.5, and SMO with a polynomial kernel produce the most accurate classifiers. One major limitation of these studies is that they relied on a single data source for each country. Because a single newspaper source was used for each region, it could be argued that the classifiers were classifying the documents based on the newspaper rather than on geographical region. To examine this possibility, we evaluated the performance of the classifier on a different genre: forum posts. The results here are less than compelling; nevertheless the classifier had moderate accuracy on classifying forum posts.1 We will examine this in more detail in future work using a larger corpus from a wider breadth of sources. Finally, we examined the effect of document size on classification accuracy finding that we could get good classification accuracy even for relatively short documents. These studies suggest that the answer to both questions raised in the beginning sentence of this paragraph is yes: yes we can geographically classify document based on common word frequency and yes we can classify regional differences in Modern Standard Arabic.<br />
This work has direct practical application to intelligence tasks. It may help in determining the author of an anonymous document. For example, a geographical classifier can be used as one module of a system designed to detect cyber terrorist threats against the U.S. by aiding in the identification of the source of the threat. Finally, many Arabic scholars (Shukri B. Abed, p.c.) believe there are no regional variations of Modern Standard Arabic. The work reported on here provides some support for the alternative view that there are regional variations (see, for example, Ibrahim and Ibrahim, 2009 and Abdelali, 2004). Future work using larger corpora from a broad number of sources may provide stronger evidence for this position. (<a href="http://www.zacharski.org/papers/caasl2009.pdf">PDF of the draft paper</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.zacharski.org/2009/07/23/paper-accepted-to-arabic-script-languages-workshop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Paper submitted to the Arabic Script-based Languages Workshop</title>
		<link>http://www.zacharski.org/2009/05/21/paper-submitted-to-the-arabic-script-based-languages-workshop/</link>
		<comments>http://www.zacharski.org/2009/05/21/paper-submitted-to-the-arabic-script-based-languages-workshop/#comments</comments>
		<pubDate>Thu, 21 May 2009 01:59:51 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Research]]></category>

		<guid isPermaLink="false">http://roz.rockettools.com/2010/02/14/paper-submitted-to-the-arabic-script-based-languages-workshop/</guid>
		<description><![CDATA[Ahmed Abdelali, Steve Helmreich and I just submitted a paper to CAASL3: Computational Approaches to Arabic Script-based Languages to be held in Ottawa on August 26th. It reports on work we have done on geographical classification of Arabic text. We presented a paper on this topic at the Chicago Colloquia on Digital Humanities and Computer [...]]]></description>
			<content:encoded><![CDATA[<p>Ahmed Abdelali, Steve Helmreich and I just submitted a paper to CAASL3: Computational Approaches to Arabic Script-based Languages to be held in Ottawa on August 26th. It reports on work we have done on geographical classification of Arabic text. We presented a paper on this topic at the Chicago Colloquia on Digital Humanities and Computer Science back in November 2008 (<a href="http://www.zacharski.org/papers/dumpsterPaper_v2.pdf">Linguistic Dumpster Diving: Geographical Classification of Arabic Text – pdf</a>). At that colloqiua a number of people gave us good suggestions and criticisms. Our work since then has included investigating the suggestions these people made and also addressing the criticisms. For example, one individual suggested we look at non-linear methods of classification.<span id="more-52"></span>One thing we did was to compare learning algorithms on this task. In our original work we used a support vector machine approach. We compared that approach to C4.5 decision trees, Bagging C4.5, Hyperpipes, nearest neighbor, K-nearest neighbors, Naive Bayes, Neural Network classifiers, SMO with a polynomial kernel and SMO with an RBF kernel. Of these, SMO with a polynomial kernel, neural nets, and Bagging C4.5 appear to perform the best. In addition, we invested the performance improvement from adding data from new sources. We are continuing work in this area. If you have any questions or suggestions please let us know.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.zacharski.org/2009/05/21/paper-submitted-to-the-arabic-script-based-languages-workshop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Paper submitted to the Machine Translation Summit</title>
		<link>http://www.zacharski.org/2009/04/30/paper-submitted-to-the-machine-translation-summit/</link>
		<comments>http://www.zacharski.org/2009/04/30/paper-submitted-to-the-machine-translation-summit/#comments</comments>
		<pubDate>Thu, 30 Apr 2009 02:00:08 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Research]]></category>

		<guid isPermaLink="false">http://roz.rockettools.com/?p=53</guid>
		<description><![CDATA[As I mentioned in previous posts, I developed (with tremendous help from Adam Zacharski) a cross-language instant messaging system using Adobe Flex. This system provides concurrent real-time translation for instant messaging using multiple machine translation engines. During this last academic year, Bill Ogden, my colleague in New Mexico, and several people in his lab (Sieun [...]]]></description>
			<content:encoded><![CDATA[<p>As I mentioned in previous posts, I developed (with tremendous help from Adam Zacharski) a cross-language instant messaging system using Adobe Flex. This system provides concurrent real-time translation for instant messaging using multiple machine translation engines. During this last academic year, Bill Ogden, my colleague in New Mexico, and several people in his lab (Sieun An and Yuki Ishikawa) used this system to evaluate the performance of machine translation systems based on how effective they were in helping people accomplish shared tasks. They used paid participants who worked in pairs (one Japanese speaker paired with a native English speaker) to accomplish a photo identification task using this instant messaging system. We just submitted a paper describing the results of this work to the Machine Translation Summit in Ottawa in August.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.zacharski.org/2009/04/30/paper-submitted-to-the-machine-translation-summit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Amazing Remix</title>
		<link>http://www.zacharski.org/2009/04/04/amazing-remix/</link>
		<comments>http://www.zacharski.org/2009/04/04/amazing-remix/#comments</comments>
		<pubDate>Sat, 04 Apr 2009 02:00:50 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Music]]></category>

		<guid isPermaLink="false">http://roz.rockettools.com/?p=55</guid>
		<description><![CDATA[Okay. This is my first youtube post. What this guy did was take individual performers on youtube–many of them were instructional videos and remixed them into a band. Truly amazing!]]></description>
			<content:encoded><![CDATA[<p>Okay. This is my first youtube post. What this guy did was take individual performers on youtube–many of them were instructional videos and remixed them into a band. Truly amazing!<span id="more-55"></span></p>
<p><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="425" height="349" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="allowScriptAccess" value="always" /><param name="src" value="http://www.youtube.com/v/tprMEs-zfQA&amp;border=1&amp;color1=0x5d1719&amp;color2=0xcd311b&amp;hl=en_US&amp;feature=player_embedded&amp;fs=1" /><param name="allowfullscreen" value="true" /><embed type="application/x-shockwave-flash" width="425" height="349" src="http://www.youtube.com/v/tprMEs-zfQA&amp;border=1&amp;color1=0x5d1719&amp;color2=0xcd311b&amp;hl=en_US&amp;feature=player_embedded&amp;fs=1" allowscriptaccess="always" allowfullscreen="true"></embed></object></p>
]]></content:encoded>
			<wfw:commentRss>http://www.zacharski.org/2009/04/04/amazing-remix/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Playing with the Stanford Log-linear Part-Of-Speech Tagger</title>
		<link>http://www.zacharski.org/2009/03/04/playing-with-the-stanford-log-linear-part-of-speech-tagger/</link>
		<comments>http://www.zacharski.org/2009/03/04/playing-with-the-stanford-log-linear-part-of-speech-tagger/#comments</comments>
		<pubDate>Wed, 04 Mar 2009 02:01:25 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Research]]></category>

		<guid isPermaLink="false">http://roz.rockettools.com/?p=57</guid>
		<description><![CDATA[I would like to create a part-of-speech tagger for Paraguayan Guarani. Initially I thought I would use the Brill part of speech tagger, but it seems to have vanished from the web. In my search, I ran across the Stanford Log-Linear Part-Of-Speech Tagger. It was developed by Chris Manning’s group and I figured anything developed by Chris [...]]]></description>
			<content:encoded><![CDATA[<p>I would like to create a part-of-speech tagger for Paraguayan Guarani. Initially I thought I would use the Brill part of speech tagger, but it seems to have vanished from the web. In my search, I ran across the <a href="http://nlp.stanford.edu/software/tagger.shtml">Stanford Log-Linear Part-Of-Speech Tagger</a>. It was developed by <a href="http://nlp.stanford.edu/~manning/">Chris Manning’</a>s group and I figured anything developed by Chris Manning is probably exceptional.  I downloaded it and ran the included English part-of-speech tagger on a 250k text (a public domain Tom Swift book). <span id="more-57"></span>It took about about 1/2 hr. on a newish Core Duo machine. Training a part-of-speech tagger is a bit more complex simply because of the lack of documentation. First you need a tagged corpus. There is some variability allowed in how this text is formatted. I simply used a text file where the word-tag pair is represented as word_tag. For example,</p>
<blockquote><p>The_DT old_JJ Foger_NNP homestead_NN is_VBZ closed_VBN up_RP ,_, though_IN I_PRP did_VBD see_VB a_DT man_NN working_VBG around_IN it_PRP to-day_JJ as_IN I_PRP came_VBD past_NN ._.</p></blockquote>
<p>In addition to this text file you need a props file (basically a configuration file). There actually is a sample configuration file in the models folder of the Stanford download. You need to edit that to match your local settings.  Finally you will need to up the amount of memory allocated to java.</p>
<p>The command I used that actually generated a part of speech tagger is</p>
<blockquote><p>java -mx500m   -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model guaranimodel -trainFile guarani.txt -prop models\mymodel.props</p></blockquote>
<p>Once I play around with this more I will post how well this works.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.zacharski.org/2009/03/04/playing-with-the-stanford-log-linear-part-of-speech-tagger/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Textbooks for data mining</title>
		<link>http://www.zacharski.org/2008/11/16/textbooks-for-data-mining/</link>
		<comments>http://www.zacharski.org/2008/11/16/textbooks-for-data-mining/#comments</comments>
		<pubDate>Sun, 16 Nov 2008 02:02:40 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Teaching]]></category>

		<guid isPermaLink="false">http://roz.rockettools.com/?p=61</guid>
		<description><![CDATA[I finally  made a decision regarding what textbook to use for a data mining course I will be teaching in the spring. One challenge was that the course is cross-listed in a variety of departments: computer science, business, and information technology and, as a result, the students taking the class will have a diversity of [...]]]></description>
			<content:encoded><![CDATA[<p>I finally  made a decision regarding what textbook to use for a data mining course I will be teaching in the spring. One challenge was that the course is cross-listed in a variety of departments: computer science, business, and information technology and, as a result, the students taking the class will have a diversity of backgrounds–some strong in statistics, others in programming. My original plan was not to have people do programming at all and have them just use Weka, a free, data mining tool. I was considering 2 textbooks: <a href="http://www.amazon.com/Introduction-Data-Mining-Pang-Ning-Tan/dp/0321321367/ref=pd_bxgy_b_img_c"><em>Introduction to Data Mining</em></a> by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar; and <a href="http://www.amazon.com/Data-Mining-Practical-Techniques-Management/dp/0120884070/ref=sr_1_1?ie=UTF8&amp;s=books&amp;qid=1226845081&amp;sr=1-1"><em>Data Mining: Practical Machine Learning Tools and Techniques</em></a>, by Ian Witten and Eibe Frank.<span id="more-61"></span> I’ve owned the Witten &amp; Eibe book for quite some time and found it useful, but the Tan et al. book seemed more comprehensive and presented a bit more of the mathematical foundations (but it wasn’t overwhelming math). Then a third book entered the picture <em><a href="http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325/ref=pd_bbs_sr_1?ie=UTF8&amp;s=books&amp;qid=1226844127&amp;sr=8-1">Programming Collective Intelligence</a> </em>by Toby Segaran. Even though I have only had it for a week, I like this book a lot. The book is oriented toward <strong>applying </strong>data mining tools. It uses Python and involves connecting to a variety of online data (for example, del.icio.us links). For example, the book covers how to make a recommendation system.</p>
<p><a href="http://roz.rockettools.com/wp-content/uploads/2010/02/collective.jpg"><img class="alignleft size-full wp-image-63" title="collective" src="http://roz.rockettools.com/wp-content/uploads/2010/02/collective.jpg" alt="" width="224" height="224" /></a>So now I have placed my textbook order and I am having two required textbooks: the collective intelligence book and the Witten &amp; Eibe one.</p>
<p>If you are a student taking this course in the spring and if you don’t already know Python, I would recommend learning Python basics over Christmas break.</p>
<p>One thing I will cover in the course that is not in either of these two books is visualization. We will probably be using the programming language Processing. Another possibility is to use the excellent site, many-eyes.com, which allows people to visualize their own data.</p>
<p>This should be a fun!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.zacharski.org/2008/11/16/textbooks-for-data-mining/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
