What business of mine is the future? No doubt Seldon has foreseen it and prepared against it. There will be other crises in the time to come when money power has become as dead a force as religion is now. Let my successors solve those new problems, as I have solved the one of today. - Foundation (Isaac Asimov)

ProjectGutenberg-NLP

This project studies the use of different machine learning tools for book classification by genre and topic. Final project of SD201 - Mining of Large Datasets - Télécom Paris.

Using a collection of more than 30000 English books from the Project Gutenberg free digital library, we tried different machine learning techniques such as decision trees, SVM and neural networks to create an algorithm that can give an appropriate subject label to a text. To extract features for the text, we used different NLP techniques such as TF-IDF and Word2Vec.

Below you can find a report about the algorithm (Explanation, experimental analysis and performance reports) and its source code. This was the final project of SD201 - Mining of Large Datasets at Télécom Paris I've made along 3 other colleagues. You can find its source code on Github.

This project was a great opportunity to gather experience in:

  • Machine Learning.
  • Data Fetching and processing of raw data
  • Data mining.
  • Feature extraction.
  • Scientific writing.


Created on the 20th of March 2023. Last edition on 26/3/2023