I have a large set of documents in a small amount of formats (html, pdf) and I have to analyze their content and try and classify them in some sort of order, so that I can find what I’m looking for. What does this describe?
- The Internet
- A researcher’s electronic library in a directory
- 99% of computer user’s hard drives
If you answered yes to all three, then welcome to the wonderful world of search engines. This is a big ticket item obviously, and hard to do. If you’re successful you’ll probably be riding around in your own Gulfstream jet. Google looks for pages that are popular and let’s the human resources of the Internet classify their pages for them. Subcontracting out the classification scheme to the human race. In lieu of having several billion people classify the docs on my desktop, what can I do to analyze and classify my docs? A couple of things:
- Some of the docs are electronic books I bought, so I can actually use their ISBNs to look them up and classify them here: http://isbndb.com/
2. Manually classify them by looking at their subject matter and using a guide. Since we are already classifying books by that system, that might be a semi-decent way to start. But only librarians use that and there are limitations. There is also the Amazon classification system. I was unable to find out the exact method to their madness, but the first hierarchy under books is:
4-for-3 Books
Arts & Photography
Bargain Books
Biographies & Memoirs
Business & Investing
Calendars
Children’s Books
Christian Books & Bibles
Comics & Graphic Novels
Computers & Technology
Cookbooks, Food & Wine
Crafts, Hobbies & Home
Education & Reference
Gay & Lesbian
Health, Fitness & Dieting
History
Humor & Entertainment
Large Print
Law
Literature & Fiction
Medical Books
Mystery, Thriller & Suspense
Parenting & Relationships
Politics & Social Sciences
Professional & Technical
Religion & Spirituality
Romance
Science & Math
Science Fiction & Fantasy
Self-Help
Sports & Outdoors
Teens
Travel
Which looks like it is done by sales, and changes often. So we are presented with few viable options. You can attempt to create a classification system by using a Computational Linguistics definition of the document given by analysis of the doc’s corpus. Many have tried, many have failed. May fortune favor the foolish.