Mathematics, rightly viewed, possesses not only truth, but supreme beauty: a beauty cold and austere, like that of sculpture, without appeal to any part of our weaker nature, without the gorgeous trappings of painting or music, yet sublimely pure, and capable of a stern perfection such as only the greatest art can show. - Bertrand Russell

ProjectGutenberg-NLP

This project studies the use of different machine learning tools for book classification by genre and topic. Final project of SD201 - Mining of Large Datasets - Télécom Paris.

Using a collection of more than 30000 English books from the Project Gutenberg free digital library, we tried different machine learning techniques such as decision trees, SVM and neural networks to create an algorithm that can give an appropriate subject label to a text. To extract features for the text, we used different NLP techniques such as TF-IDF and Word2Vec.

Below you can find a report about the algorithm (Explanation, experimental analysis and performance reports) and its source code. This was the final project of SD201 - Mining of Large Datasets at Télécom Paris I've made along 3 other colleagues. You can find its source code on Github.

This project was a great opportunity to gather experience in:

  • Machine Learning.
  • Data Fetching and processing of raw data
  • Data mining.
  • Feature extraction.
  • Scientific writing.


Created on the 20th of March 2023. Last edition on 26/3/2023