Une théorie est bonne quand elle est belle. - Henry Poincaré

ProjectGutenberg-NLP

This project studies the use of different machine learning tools for book classification by genre and topic. Final project of SD201 - Mining of Large Datasets - Télécom Paris.

Using a collection of more than 30000 English books from the Project Gutenberg free digital library, we tried different machine learning techniques such as decision trees, SVM and neural networks to create an algorithm that can give an appropriate subject label to a text. To extract features for the text, we used different NLP techniques such as TF-IDF and Word2Vec.

Below you can find a report about the algorithm (Explanation, experimental analysis and performance reports) and its source code. This was the final project of SD201 - Mining of Large Datasets at Télécom Paris I've made along 3 other colleagues. You can find its source code on Github.

This project was a great opportunity to gather experience in:

  • Machine Learning.
  • Data Fetching and processing of raw data
  • Data mining.
  • Feature extraction.
  • Scientific writing.


Created on the 20th of March 2023. Last edition on 26/3/2023