Textbooks for data mining

November 16, 2008

I finally  made a decision regarding what textbook to use for a data mining course I will be teaching in the spring. One challenge was that the course is cross-listed in a variety of departments: computer science, business, and information technology and, as a result, the students taking the class will have a diversity of backgrounds–some strong in statistics, others in programming. My original plan was not to have people do programming at all and have them just use Weka, a free, data mining tool. I was considering 2 textbooks: Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar; and Data Mining: Practical Machine Learning Tools and Techniques, by Ian Witten and Eibe Frank. >>

Delivered presentation at the Chicago Digital Humanities Conference

November 3, 2008

About an hour ago I presented the talk titled Linguistic Dumpster Diving: Geographical Classification of Arabic Text. I co-authored this paper with my colleagues at New Mexico State University, Ahmed, Jim, and Steve. I think the talk was well-received and I received a number of great comments and suggestions. Unfortunately, I don’t know the names of all the people who made suggestions so I can’t credit them all by name. In the talk, I primarily focused on a support vector machine approach to geographically classifying text. >>

Linguistic Dumpster Diving

September 21, 2008

I am grateful to have the paper I wrote with my colleagues accepted at the Chicago Colloquium on Digital Humanities and Computer Science to be held Nov 1-3. Here’s the abstract. >>

Pinetop Perkins – Antone’s – Austin

August 16, 2008

I was very fortunate to hear Pinetop Perkins at Antone’s. I was planning on going to his concert at the Kennedy Center for the Performing Arts in September. But Antone’s is a small intimate blues club. (photoes by by Jack O’Diamonds). For those who don’t know Pinetop Perkins, he is a 95 yr. old blues piano player.  He originally was a blues guitarist but early in his career he injured his arm due to a fight with a choir girl and he switched to piano. His official website lists him as “one of the last great Mississippi bluesmen”, but others consider him part of the Chicago school of blues. >>

Defcon

August 12, 2008

I just returned from the Defcon conference in Las Vegas. The conference deals with hacking and computer security (The organizers call it “real time social networking for ninjas”  and Wired calls it the “world’s largest computer security convention.”) This year a federal judge prevented 3 MIT students from giving their talk on how to hack the smart cards used by the Boston subway system. Fortunately, their entire detailed presentation was included on the conference CD. I went to a number of talks dealing with penetration testing. Joe Cicero talked about hacking into the typical web applications used by universities. Nathan Hamiel and Shwn Moyer gave an excellent talk on attacking social networks. Most related to my work was a talk on breaking into SCADA systems and a talk on scanning for active ports on the internet. >>

Ka and Stomp out loud – Las Vegas

August 11, 2008

I was in Vegas for a conference and went to both Cirque du Soleil’s Ka, and Stomp Out Loud.  My tour book said something like the following. In a city which is all about technology, Ka is the most technological of all the Vegas shows. The show has a multiple part stage. Each part can raise and lower, be moved forward and backward, and tilt. At one moment the performers are on a flat stage, the stage starts tilting and it is as if the performers are climbing a hill. The stage finally is vertical and the performers are twirling acrobatically on a face of a cliff. It’s mind blowing and Vegas all the way. The last time I went to Las Vegas I went to see Cirque du Soleil’s Mystere. Ka is darker and less overtly athletic. The other show I went to this time was Stomp Out Loud at Planet Hollywood.  This one was decidedly low tech making use of brooms, newspaper, barrels among other things to make rhythms.  Pretty cool.

Writing a Flex Application – part 2

August 5, 2008

I finally have a reasonable demo of this application. Now I need to do a thorough job of testing and debugging. As I mentioned in the previous post, this is a chat client where two people can be messaging in different languages and the chat system translates from one to the other. One feature of the chat client is shown in the screenshot above. If you mouse over any translation, you can see alternative translations. So if one translation engine did a poor job in translating a chat message, you can look at translations from other engines. We hope that this feature, as well as several others, will improve the usefulness of this chat system.

The backend of the system is in PHP connected to a MySQL database server. I am using Flex for the client side component.

As I mentioned in my previous post, this system will be used to evaluate the effectiveness of multilingual instant messaging systems and is a continuation of Bill Ogden’s work on Computer mediated multilingual translation.

Writing a Flex application

July 19, 2008

This summer I have been doing some contract work with the Bill Ogden of the Psychology Department at New Mexico State University. For years he has been investigating the usability of IM systems that have a machine translation component. In order to perform more nuanced experiments he needed custom designed IM system and he contracted with me to perform the work. When you log into the chat system you select a language and all messages from other participants will be translated to that language. For example, this would enable a Korean speaker and an English speaker to communicate with one another. The server-side system which does all the translation and maintains the chat is in PHP. The client side app is in Adobe Flex. I was going to use AJAX but my son recommended that I look at Flex. I think Flex is fantastic, but learning it has been a challenge. Most of my problems have been with understanding gui components. The screenshot above is of the system right now, about midway through the project. Throughout the course of the last month I have worked through the following Flex books: >>

W.C. Clark – Saxon Pub

July 2, 2008

I went to hear guitarist and vocalist W.C. Clark at the Saxon Pub in Austin.  He’s been called the godfather of Austin blues. He has played with a number of the other big names in Austin blues including Stevie Ray Vaughn. He was born in Austin in 1939.  That makes him close to 70 and I was astonished by the energy he had. He played his 1 1/2 hr sets without break–one song seamlessly morphing into the next. At times he would talk to the audience while comping blues patterns. His tunes ranged from standard blues to covers of Stevie Wonder tunes. The Saxon pub was small and his sound system was not overpowering meaning I wasn’t worrying about hearing loss. The sound was balanced well. His quartet consisted of himself on guitar and vocals, a drummer, bass player and a sax/harmonica player. Sadly I don’t know their names. If you are interested in hearing him he has clips on his website and there are some youtube videos.  he plays mostly in the Austin & hill country area.

Searching Arabic News Archive

June 24, 2008

I finally finished an initial draft of an online search tool for Arabic news sites. This work is part of the project Cactus: Computational Analysis of Cyber Terrorism against U.S., sponsored by the US Army Development Test Command, White Sand Missile Range. We (our team at New Mexico State University) are spidering a number of Arabic news sites daily. We index the content of these sites using the Indri Indexing and Search tool. We modified the tool slightly so it will perform stemming (Arabic Light 10 stemming) on UTF8 text. Right now we have over 10,000 Arabic documents in our collection and it is growing daily. We still have a problem with detecting duplicate documents and are investigating various methods for detecting duplicate documents in a reasonable amount of time. I currently detect exact duplicates by using the sha1 hash function.