Identifying the source of Arabic documents

For the last few months I’ve been working on methods to identify the source of Arabic documents. For example, given a document I would like to identify where it was written (Syria, Libya, Sudan, etc). This task is part of a larger project to identify cyberterrorist threats involving New Mexico Tech, New Mexico State University, and the University of Mary Washington.  I have over 4000 Arabic documents from 5 different newspapers. Most of the documents are around 15-25k in size. My method uses the sequential minimal optimization algorithm to train a support vector machine. I have been evaluating the approach using 10 fold cross validation and have been getting over 99% classification accuracy. I am currently working on writing several papers on this. As soon as I have a paper accepted I will post it here.

No Comments

Categories Research | Tags:

Social Networks: Facebook, Twitter, Google Bookmarks, del.icio.us, StumbleUpon, Digg, Reddit, Posterous.

You can follow any follow up comments to this entry through the RSS 2.0 feed.

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

By submitting a comment here you grant Ron Zacharski a perpetual license to reproduce your words and name/web site in attribution. Inappropriate or irrelevant comments will be removed at an admin's discretion.