Text Mining a document's classification

March 23, 2012
jenholm
Data Science Blog
0

I have a large set of documents in a small amount of formats (html, pdf) and I have to analyze their content and try and classify them in some sort of order, so that I can find what I’m looking for. What does this describe?

The Internet
A researcher’s electronic library in a directory
99% of computer user’s hard drives

If you answered yes to all three, then welcome to the wonderful world of search engines. This is a big ticket item obviously, and hard to do. If you’re successful you’ll probably be riding around in your own Gulfstream jet. Google looks for pages that are popular and let’s the human resources of the Internet classify their pages for them. Subcontracting out the classification scheme to the human race. In lieu of having several billion people classify the docs on my desktop, what can I do to analyze and classify my docs? A couple of things:

Some of the docs are electronic books I bought, so I can actually use their ISBNs to look them up and classify them here: http://isbndb.com/

2. Manually classify them by looking at their subject matter and using a guide. Since we are already classifying books by that system, that might be a semi-decent way to start. But only librarians use that and there are limitations. There is also the Amazon classification system. I was unable to find out the exact method to their madness, but the first hierarchy under books is:

4-for-3 Books

Arts & Photography

Bargain Books

Biographies & Memoirs

Business & Investing

Calendars

Children’s Books

Christian Books & Bibles

Comics & Graphic Novels

Computers & Technology

Cookbooks, Food & Wine

Crafts, Hobbies & Home

Education & Reference

Gay & Lesbian

Health, Fitness & Dieting

History

Humor & Entertainment

Large Print

Law

Literature & Fiction

Medical Books

Mystery, Thriller & Suspense

Parenting & Relationships

Politics & Social Sciences

Professional & Technical

Religion & Spirituality

Romance

Science & Math

Science Fiction & Fantasy

Self-Help

Sports & Outdoors

Teens

Travel

Which looks like it is done by sales, and changes often. So we are presented with few viable options. You can attempt to create a classification system by using a Computational Linguistics definition of the document given by analysis of the doc’s corpus. Many have tried, many have failed. May fortune favor the foolish.

Enholm Heuristics

Leave a Reply Cancel reply