.. | ||
filter.py | ||
gen.sh | ||
markov.py | ||
poe.txt | ||
README.md |
Fun with Markov Chains
The subject: https://codingdojo.org/kata/MarkovChain/
Because this is as fun as Large Language Models, but does not need large sets of stolen texts and centillions of GPUs to run.
Building the chains
The filter.py
must first be used to extract the word probabilities, given the previous word. It behaves like a Unix filter, reading the text to analyse on its standard input and printing the statistics on its standard output.
Texts that are used for training should be quite clean, and it can be interesting to use filter.py
with other filters to get the most effective statistics. As an example, the gen.sh
script proceeds with the following steps.
- Removing of all the carriage returns and line feeds.
- Setting all the characters in lower case.
- Adding spaces around commas and periods, so that those punctuation symbols will be used as words.
- Running of the filter and registering of the result in a file.
Generating surrealistic sequences of words
The markov.py
program understands the following arguments.
-f
to provide the path to a statistics file. This is mandatory, of course.-w
to provide a first word. If not provided, the first word will be chosen randomly.-n
, the number of words to display (including the given one, if any).
Notes for testing
The Python programs rely on the args
kata, available in a sibling directory. So PYTHONPATH=../args
should be used to run the programs in place.
A poe.txt
file is provided as a sample. It contains excerpts of Tales of the Grotesque and Arabesque from Edgar Allan Poe. They are of course in the public domain.
> ./gen.sh poe.txt
> PYTHONPATH=../args ./markov.py -f poe.txt.stats -w le
le pavé de moi , puisqu’elle était un peu près semblable perfection dans les cas qui semblaient n’éprouver aucune donnée pour de leur cœur ⏎
Interesting future paths
The generated statistics is a large dictionary of word/probability, indexed by the previous word.
By default, words are thus written once as a key, and as many times as they are found elsewhere in the text, with a probability of occurrence. It should be more efficient to associate short identifiers (integers) to the words, and to use them anywhere the word is to found.
This is what is done by the filter.py
program when it is given the -t
argument. However:
- the
markov.py
program does not understand this format yet. - the result is not very convincing yet because the statistics are stored with JSON (an easy option to begin with), and that this format does not store integers in a concise way.