Italian Word Embeddings

Human Language Technologies (HLT), Istituto di Scienza e Tecnologie dell'Informazione "A. Faedo", Consiglio Nazionale delle Ricerche - Pisa, Italy


Overview

We generated word embeddings with two popular word representation models, word2vec and GloVe trained on the Italian Wikipedia.

Models

The word vectors trained with skipgram's word2vec are available here (1.5 Gb) and the word vectors trained with GloVe are available here (790 Mb).

The tar.gz files contain the pickled models that are readily usable (after decompression) with the Gensim framework.

The pickeld word2vec files include the entire model and can be also retrained with new data. The pickled GloVe files include only the word vectors.

Using the models

This is a sample of code on how to load and use the models.

>from gensim.models import Word2Vec

>model = Word2Vec.load('glove_WIKI') # glove model
OR
>model = Word2Vec.load('wiki_iter=5_algorithm=skipgram_window=10_size=300_neg-samples=10.m') # word2vec model

>for word in model.vocab: # here you get the words
       print(word,model.vocal[word]) # frequency stats

>model.wv['sole'] # here you get the numpy embedding vectors
Out: 
array([ -1.86661184e-02,   1.31065890e-01,   3.69563736e-02,
        -6.03673719e-02,   6.20404482e-02,   5.64207993e-02,.......

Italian Word Analogy Questions.

Once you have loaded the model you can evaluate it on the Italian word analogy test downloadable here.

References

word2vec

GloVe