46 lines
2.6 KiB
Markdown
46 lines
2.6 KiB
Markdown
|
# Fun with Markov Chains
|
|||
|
|
|||
|
The subject: https://codingdojo.org/kata/MarkovChain/
|
|||
|
|
|||
|
Because this is as fun as Large Language Models, but does not need large sets of stolen texts and centillions of GPUs to run.
|
|||
|
|
|||
|
## Building the chains
|
|||
|
|
|||
|
The `filter.py` must first be used to extract the word probabilities, given the previous word. It behaves like a Unix filter, reading the text to analyse on its standard input and printing the statistics on its standard output.
|
|||
|
|
|||
|
Texts that are used for training should be quite clean, and it can be interesting to use `filter.py` with other filters to get the most effective statistics. As an example, the `gen.sh` script proceeds with the following steps.
|
|||
|
|
|||
|
- Removing of all the carriage returns and line feeds.
|
|||
|
- Setting all the characters in lower case.
|
|||
|
- Adding spaces around commas and periods, so that those punctuation symbols will be used as words.
|
|||
|
- Running of the filter and registering of the result in a file.
|
|||
|
|
|||
|
## Generating surrealistic sequences of words
|
|||
|
|
|||
|
The `markov.py` program understands the following arguments.
|
|||
|
|
|||
|
- `-f` to provide the path to a statistics file. This is mandatory, of course.
|
|||
|
- `-w` to provide a first word. If not provided, the first word will be chosen randomly.
|
|||
|
- `-n`, the number of words to display (including the given one, if any).
|
|||
|
|
|||
|
## Notes for testing
|
|||
|
|
|||
|
The Python programs rely on the `args` kata, available in a sibling directory. So `PYTHONPATH=../args` should be used to run the programs in place.
|
|||
|
|
|||
|
A `poe.txt` file is provided as a sample. It contains excerpts of Tales of the Grotesque and Arabesque from Edgar Allan Poe. They are of course in the public domain.
|
|||
|
|
|||
|
> ./gen.sh poe.txt
|
|||
|
> PYTHONPATH=../args ./markov.py -f poe.txt.stats -w le
|
|||
|
le pavé de moi , puisqu’elle était un peu près semblable perfection dans les cas qui semblaient n’éprouver aucune donnée pour de leur cœur ⏎
|
|||
|
|
|||
|
# Interesting future paths
|
|||
|
|
|||
|
The generated statistics is a large dictionary of word/probability, indexed by the previous word.
|
|||
|
|
|||
|
By default, words are thus written once as a key, and as many times as they are found elsewhere in the text, with a probability of occurrence. It should be more efficient to associate short identifiers (integers) to the words, and to use them anywhere the word is to found.
|
|||
|
|
|||
|
This is what is done by the `filter.py` program when it is given the `-t` argument. However:
|
|||
|
|
|||
|
- the `markov.py` program does not understand this format yet.
|
|||
|
- the result is not very convincing yet because the statistics are stored with JSON (an easy option to begin with), and that this format does not store integers in a concise way.
|