Pre-trained models allow you to get up and running quickly, but you can likely improve performance on your dataset by fine tuning them. Normally, you'll bring your own data to the party, but for these examples we'll use datasets published on Hugging Face. Make sure you've installed the required data dependencies detailed in setup.
The Helsinki-NLP organization provides more than a thousand pre-trained models to translate between different language pairs. These can be further fine tuned on additional datasets with domain specific vocabulary. Researchers have also created large collections of documents that have been manually translated across languages by experts for training data.
The kde4 dataset contains many language pairs. Subsets can be loaded into your Postgres instance with a call to pgml.load_dataset, or you may wish to create your own fine tuning dataset with vocabulary specific to your domain.
id|translation-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------99|{"en":"If you wish to manipulate the DOM tree in any way you will have to use an external script to do so.","es":"Si desea manipular el árbol DOM deberá utilizar un script externo para hacerlo."}100|{"en":"Credits","es":"Créditos"}101|{"en":"The domtreeviewer plugin is Copyright & copy; 2001 The Kafka Team/ Andreas Schlapbach kde-kafka@master. kde. org schlpbch@unibe. ch","es":"Derechos de autor de la extensión domtreeviewer & copy;. 2001. El equipo de Kafka/ Andreas Schlapbach kde-kafka@master. kde. org schlpbch@unibe. ch."}102|{"en":"Josef Weidendorfer Josef. Weidendorfer@gmx. de","es":"Josef Weidendorfer Josef. Weidendorfer@gmx. de"}103|{"en":"ROLES_OF_TRANSLATORS","es":"Rafael Osuna rosuna@wol. es Traductor"}(5rows)
When you're constructing your own datasets for translation, it's important to mirror the same table structure. You'll need a JSONB column named translation, that has first has a "from" language name/value pair, and then a "to" language name/value pair. In this English to Spanish example we use from "en" to "es". You'll pass a y_column_name of translation to tune the model.
Translations use the pgml.generate API since they return TEXT rather than numeric values. You may also call pgml.generate with a TEXT[] for batch processing.
DistilBERT is a small, fast, cheap and light Transformer model based on the BERT architecture. It can be fine tuned on specific datasets to learn further nuance between positive and negative examples. For this example, we'll fine tune distilbert-base-uncased on the IMBD dataset, which is a list of movie reviews along with a positive or negative label.
Without tuning, DistilBERT classifies every single movie review as positive, and has a F1 score of 0.367, which is about what you'd expect for a relatively useless classifier. However, after training for a single epoch (takes about 10 minutes on an Nvidia 1080 TI), the F1 jumps to 0.928 which is a huge improvement, indicating DistilBERT can now fairly accurately predict sentiment from IMDB reviews. Further training for another epoch only results in a very minor improvement to 0.931, and the 3rd epoch is flat, also at 0.931 which indicates DistilBERT is unlikely to continue learning more about this particular dataset with additional training. You can view the results of each model, like those trained from scratch, in the dashboard.
Once our model has been fine tuned on the dataset, it'll be saved and deployed with a Project visible in the Dashboard, just like models built from simpler algorithms.
The IMDB dataset has 50,000 examples of user reviews with positive or negative viewing experiences as the labels, and is split 50/50 into training and evaluation datasets.
Tuning has a nearly identical API to training, except you may pass the name of a model published on Hugging Face to start with, rather than training an algorithm from scratch.
The default for predict in a classification problem classifies the statement as one of the labels. In this case, 0 is negative and 1 is positive. If you'd like to check the individual probabilities associated with each class you can use the predict_proba API:
At a high level, summarization uses similar techniques to translation. Both use an input sequence to generate an output sequence. The difference being that summarization extracts the most relevant parts of the input sequence to generate the output.
BillSum is a dataset with training examples that summarize US Congressional and California state bills. You can pass kwargs specific to loading datasets, in this case we'll restrict the dataset to California samples:
This dataset has 3 fields, but summarization transformers only take a single input to produce their output. We can create a view that simply omits the title from the training data:
Or, it might be interesting to concat the title to the text field to see how relevant it actually is to the bill. If the title of a bill is the first sentence, and doesn't appear in summary, it may indicate that it's a poorly chosen title for the bill:
Tuning has a nearly identical API to training, except you may pass the name of a model published on Hugging Face to start with, rather than training an algorithm from scratch.
SELECTpgml.predict('IMDB Review Sentiment','I love SQL')ASsentiment;
```sql linenumes="1" sentiment
1 (1 row)
Time: 16.681 ms ```
The default for predict in a classification problem classifies the statement as one of the labels. In this case 0 is negative and 1 is positive. If you'd like to check the individual probabilities associated with each class you can use the predict_proba API