Paper Review TO-DO

Paper review TO-DO

Created: 2022-08-08 09:05
#to-do

Goals

Be on the same page with Maryam about the approaches used and the metrics used to evaluate the approaches.

Action items

  • WEClustering with sentence transformer and HDBSCAN: BERT with sentence transformer -> I think that the mothod used by Maryam is more similar to BERTopic than Machine Learning/Unsupervised/Clustering/WEClustering -> check this
  • Evaluation metrics:
    • PMI
    • inter/intra cluster
    • topic coherence (paper)
    • topic diversity and predictive accuracy (paper)
  • AlBERT and XLNet
  • Other approaches:
  • Test Top2Vec with deep-learn
  • Add section about tests on Top2Vec with several embedding models and chunking, with graphic comparison
  • update results table
  • Test n-gram vs no n-gram
  • results divided in qualitative results and quantitative results
  • improve plots
  • write sections on paper
  • datasets statistics
  • Study IRBO formula
  • tests on raw data:
    • tourpedia
      • bert
      • roberta
      • sentence-tr
      • top2vec
    • easytour:
      • bert
      • roberta
      • sentence-tr
      • top2vec
  • add embedding coherence

For topic coherence -> [gensim](Hyperparameters tuning — Topic Coherence and LSI model | by Eleonora Fontana | Betacom | Medium)

We should not preprocess data before fed it to BERT models:

Evaluation parameters:

  • Calinski-Harabasz -> survey (1000+ citations) new survey most of the papers are about clustering, but there are some that are about topic modeling (most of them have less of 30 citations and regards the medical field)
  • Davies-Bouldin -> number of topics optimization with 26 citations LDA similar to Calinski-Harabasz but less used
  • Perplexity -> it needs labels
  • Inter/intra cluster distance topic modeling for twitter 32 citations, there are also other papers
  • Topic diversity now works properly and could be interesting for our paper

Devised models: