Linguistic Dumpster Diving

September 21, 2008

I am grateful to have the paper I wrote with my colleagues accepted at the Chicago Colloquium on Digital Humanities and Computer Science to be held Nov 1-3. Here’s the abstract. >>

Defcon

August 12, 2008

I just returned from the Defcon conference in Las Vegas. The conference deals with hacking and computer security (The organizers call it “real time social networking for ninjas”  and Wired calls it the “world’s largest computer security convention.”) This year a federal judge prevented 3 MIT students from giving their talk on how to hack the smart cards used by the Boston subway system. Fortunately, their entire detailed presentation was included on the conference CD. I went to a number of talks dealing with penetration testing. Joe Cicero talked about hacking into the typical web applications used by universities. Nathan Hamiel and Shwn Moyer gave an excellent talk on attacking social networks. Most related to my work was a talk on breaking into SCADA systems and a talk on scanning for active ports on the internet. >>

Searching Arabic News Archive

June 24, 2008

I finally finished an initial draft of an online search tool for Arabic news sites. This work is part of the project Cactus: Computational Analysis of Cyber Terrorism against U.S., sponsored by the US Army Development Test Command, White Sand Missile Range. We (our team at New Mexico State University) are spidering a number of Arabic news sites daily. We index the content of these sites using the Indri Indexing and Search tool. We modified the tool slightly so it will perform stemming (Arabic Light 10 stemming) on UTF8 text. Right now we have over 10,000 Arabic documents in our collection and it is growing daily. We still have a problem with detecting duplicate documents and are investigating various methods for detecting duplicate documents in a reasonable amount of time. I currently detect exact duplicates by using the sha1 hash function.

Identifying the source of Arabic documents

June 11, 2008

For the last few months I’ve been working on methods to identify the source of Arabic documents. For example, given a document I would like to identify where it was written (Syria, Libya, Sudan, etc). This task is part of a larger project to identify cyberterrorist threats involving New Mexico Tech, New Mexico State University, and the University of Mary Washington.  I have over 4000 Arabic documents from 5 different newspapers. Most of the documents are around 15-25k in size. My method uses the sequential minimal optimization algorithm to train a support vector machine. I have been evaluating the approach using 10 fold cross validation and have been getting over 99% classification accuracy. I am currently working on writing several papers on this. As soon as I have a paper accepted I will post it here.