summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorBotond Hende <nettingman@gmail.com>2025-01-29 23:54:57 +0100
committerBotond Hende <nettingman@gmail.com>2025-01-29 23:54:57 +0100
commit045371fdbf3a757c890af9b57116c86b44a2f701 (patch)
treebb619a52d81f6d36ba1c33479bc5eb7f78e89666
parent57bc64eb17a8b7494c75c04cade43c96fe39a77f (diff)
hestia contentHEADmaster
-rw-r--r--posts/hestia/content.md101
-rw-r--r--posts/hestia/hestia.webpbin0 -> 90082 bytes
-rw-r--r--posts/hestia/meta.json10
3 files changed, 111 insertions, 0 deletions
diff --git a/posts/hestia/content.md b/posts/hestia/content.md
new file mode 100644
index 0000000..c592078
--- /dev/null
+++ b/posts/hestia/content.md
@@ -0,0 +1,101 @@
+So I've been working on this multipurpose custom voice assistant for a while now and I think it's time to write a blogpost about it.
+
+## A project years in the making (technically)
+I've started this project last summer, but after getting through the most interesting/hard part (or so I thought), I quickly gave up on it.
+
+The initial idea was to create a Siri/Cortana/Google Assistant style **voice assistant** for my PC. I wanted to use voice control for basic tasks like opening apps and locking the screen.
+
+The most interesting part for me was the **voice interpretation**, as I've never done anything similar before. So I've started with looking into all kinds of speech-to-text solutions. I've found a few open-source projects, but by that time most of them had been discontinued in favor of OpenAI's [Whisper](https://github.com/openai/whisper). So I've downloaded Whisper and started to experiment.
+
+For starters I wanted to make a straightforward version, where I could **activate** the "listening mode" **with a buttonpress**, and it kept listening **until I press the button again**. So I made a shortcut in Sway (my desktop environment) to run a shell script that looked something like this (it's actually a newer version):
+
+<div class="code-block"><p><span class="blue">#!/bin/bash -eu</span></p>
+<br>
+<p><span class="cyan">FIFO_PATH</span>=<span class="yellow">"</span><span class="green">/tmp/hestia-listening</span><span class="yellow">"</span></p>
+<br>
+<p><span class="yellow">if</span> <span class="red">[[</span> <span class="yellow">-p "</span><span class="magenta">$FIFO_PATH</span><span class="yellow">"</span> <span class="red">]]</span><span class="yellow">; then</span>
+ <span class="yellow">echo >> "</span><span class="magenta">$FIFO_PATH</span><span class="yellow">"</span>
+<span class="yellow">else</span>
+ <span class="cyan">SCRIPT_DIR</span>=<span class="magenta">$(</span> <span class="yellow">cd</span> <span class="red">--</span> <span class="yellow">"</span><span class="magenta">$(</span> <span class="red">dirname --</span> <span class="yellow">"</span><span class="magenta">${BASH_SOURCE[</span><span class="green">0</span><span class="magenta">]}</span><span class="yellow">"</span> <span class="magenta">)</span><span class="yellow">" &></span> <span class="red">/dev/null</span> <span class="yellow">&& pwd</span> <span class="magenta">)</span>
+ <span class="cyan">MODULE_NAME</span>=<span class="magenta">$(</span><span class="red">basename</span> <span class="yellow">"</span><span class="magenta">$SCRIPT_DIR</span><span class="yellow">"</span><span class="magenta">)</span>
+ <span class="yellow">cd "</span><span class="magenta">$SCRIPT_DIR</span><span class="yellow">"</span>
+ <span class="yellow">source</span> venv/bin/activate
+ <span class="yellow">cd</span> ..
+ python -m <span class="yellow">"</span><span class="magenta">$MODULE_NAME</span><span class="yellow">"</span>
+<span class="yellow">fi</span></p></div>
+
+The script is pretty simple: it looks for a **fifo (named pipe)** on a predefined path. If the fifo exists, the script **echoes a newline into it** (it's going to be a signal for the voice assistant); **otherwise start the voice assistant** script. When the main script started, it then created the fifo and was waiting in a loop, trying to **read it**.
+
+Whenever the fifo had some content, the voice assistant script **started recording the microphone input** or **stopped it** if it was already recording. After this the **temporary audiofile was fed into Whisper**. With this I had a simple solution for getting the user's voice input. Unfortunately this was where the magic stopped. I had no meaningful way of interpreting the text output of Whisper, other than matching to some keywords. I made some proof of concept commands, like **locking the screen**, tried them out. I also created some pre-generated text-to-speech generic responses (like "At your service!" or "Welcome back!") to make it more human-like. Then I lost interest in the project.
+
+## Revival
+About a year later, while organizing my projects I wanted to write about on the blog, I came across this code again. Personal assistant tools (like Chat-GPT's voice assistant mode) were also frequently talked about in the news, and I was thinking about setting up [Home assistant](https://www.home-assistant.io) at my place for a while. Eventually these all came together into a single idea:
+
+"""What if I made my own personal assistant?"""
+
+I started to brainstorm. And I have to say, I came up with a **LOT of ideas** and features. I'm not even going to list them all here, because I have little hope of ever implementing them all, but if I wanted to I could write blogposts only about this project for years, so that's a relief at least.
+
+The basic idea was the following: if I wanted to make this assistant useful, I had to solve the biggest issue I had in the original project first. So I did just that.
+
+## The most important part
+Buuuuut... of course I had to find a **new name** for the project first. I can't just call it "voice-assistant" like before, especially since I was planning to make this feel like an AI assistant from some sci-fi or cyberpunk story. I think I was somewhat influenced by [Hades II](https://en.wikipedia.org/wiki/Hades_II), the game which I was playing with at the time. Also, I really liked how all the AIs are named after Greek gods in ||spoiler1||Horizon: Zero Dawn||. So in the end I came up with **Hestia**, the Greek goddess of hearth and home (like a home assistant, get it?). For a while I was ruminating on calling it HestAI but in the end I dropped the idea.
+
+## HassIL to the rescue
+So obviously I could have used some **large language model** to interpret the speech-to-text output, but this was just the thing I **wanted to avoid**: big overhead and/or running non-locally, non-deterministic/unknown random factor and last but not least, I could hardly call it my own.
+
+I started to look around how natural language processing was done like 10 years ago, and how old chatbots worked. Luckily I didn't have to go too deep in the rabbithole, because [Mike](https://mikesweb.site), a friend of mine, mentioned I should check out how they did it in Home assistant.
+
+[HassIL](https://github.com/home-assistant/hassil) (Home Assistant Intent Language) is a separate module inside the Home assistant project, and it does _exactly_ what I needed. It has a pretty easy yaml config layout to set up some insane flexible expressions. HassIL then can use these expressions to interpret a text input, understand the **intent** of the user, and even extract some **intent specific data** from the sentence.
+
+Here's an example from the **config** I set up for Hestia:
+
+<div class="code-block"><p><span class="red">language</span>: <span class="green">"en"</span></p>
+<p><span class="red">intents</span>:
+ <span class="red">HesExecuteProcess</span>:
+ <span class="red">data</span>:
+ - <span class="red">sentences</span>:
+ - <span class="green">"&lt;execute&gt; &lt;process&gt; [&lt;in_workspace&gt;]"</span>
+ <span class="red">slots</span>:
+ <span class="red">domain</span>: <span class="green">"process"</span></p>
+<br>
+<p><span class="red">expansion_rules</span>:
+ <span class="red">process</span>: <span class="green">"[the ]{process}"</span>
+ <span class="red">execute</span>: <span class="green">"(execute|open|run|start)"</span></p></div>
+
+HassIL have some pretty neat way of defining flexible sentences. The meaning of the different symbols are the following:
+* anything between **<>** are subjects to an **expansion rule**; some (usually more complicated) expression defined in the same config, or in a global config file. Basically an alias.
+* **[]** marks an **optional** part of the sentence. Can be used for parts where you might or might not use articles (the, a, an), or when you might want to specify some extra info (in this example the workspace).
+* with **{}** you can set a **slot**, which is basically a variable/extra info in the sentence, with predefined values (for example: in Home assistant all rooms, device names, etc. are provided to HassIL at the start).
+* you can specify **alternative words/phrases** for the same thing, all separated by **|**.
+* you can also define **permutations** of words/phrases with **;** (not used in this example).
+* when defining alternative words or permutations, you need to enclose them in **()** if they are non-optional (or **[]** if they are)
+
+Let's take a look at the <span class="code-block-wrap"><span class="green">&lt;execute&gt; &lt;process&gt; [&lt;in_workspace&gt;]</span></span> part, as this is the actual **sentence for the intent of opening an app**. As you can see it starts with two expansion rules (both defined at the end of the config). The sentence has to start with a word of the options "execute", "open", "run" and "start". This is followed by a process/program name (all available programs are provided to HassIL at the start), which can stand with definite article "the" or on its own.
+
+The last part of the sentence is the workspace that can be optionally defined. In Sway (the window manager I use) workspaces are similar to workspaces in Gnome, or virtual desktops in KDE and Windows. It's basically a group of windows you keep on "one screen", and you can change between these screens by changing the workspace.
+
+By default, I use at most 10 workspaces in Sway, so in the global config the workspace expansion rule is set up the following way:
+* it has to be a number between 1 and 10
+* it can be either in the form of "workspace N" or "Nth workspace" (defined grammatically correctly for the exception numbers)
+
+You can check this out in detail in the [repository](https://git.wazul.moe/hestia/tree/sentences/_common.yaml).
+
+## Breathing life into the computer
+Now that some basic commands were working, I wanted to concentrate on making this whole assistant a bit more **humanlike**.
+
+For now, I was not concentrating on the voice reply. In the previous version I used pre-generated text-to-speech voice files (by just simply playing them). But the more I thought about it, the more it became clear that the assistant would be very limited, if it could not react with dynamic data (like referring to something the user asked about or reading new post on an RSS feed). I have looked into some solutions for text-to-speech generation on the fly, but this will be a topic for another blogpost.
+
+However, I wanted another part implemented as well. It was always part of my goal to have some sort of **UI popup with the face** (of the AI assistant) on the computer screen whenever it is talking. Of course with **different kind of expressions** depending on the situation, and possibly even changing it mid-sentence, similarly how **visual novels** work.
+
+I was already using [mako](https://github.com/emersion/mako), a lightweight Wayland notification daemon, which luckily had just the flexible configuration options I needed. This way, if I fire a **desktop notification** with [libnotify](https://github.com/GNOME/libnotify), it can display the content the way I want it.
+
+I defined the colors, icon location and stuff like that in the mako config file under <span class="code-block-wrap magenta">[app-name=hestia]</span>, this way my voice assistant messages will have a unique look compared to other system messages. So when you ask Hestia to start firefox, this is the result (other than firefox actually opening):
+
+![an image about a small notification window displaying the text 'Starting it now!' and an image of the Giustiniani Hestia (statue)](hestia.webp "SPEAK TO ME MACHINE LIFEFORM!" https://en.wikipedia.org/wiki/Giustiniani_Hestia "It's weird how few classical artwork I have found about Hestia. I mean she's not the most popular Greek goddess but still.")
+
+Of course the statue is a placeholder image, until I have the time to **draw** (!!!) some artwork for the assistant myself. Expect some weeb animestyle stuff coming soon™.
+
+## What plans do the Fates have for Hestia?
+I'm planning to clean up some things in the immediate future. For example **make a more flexible and extendable response system**. I already started to make the whole project a bit **more modular**, for example you can use text input instead of voice, which is much more convenient way for testing on my laptop with a not so good microphone. I plan to do the same with output, so in the end the assistant could run on different kind of devices with different kind of input and output methods (for example in a local server with Home assistant capabilities and maybe control it over chat messages, who knows). And of course I want to **extend the features** so I can start using it for actual helpful tasks already. As I've said, I have tons of ideas.
+
+Now that I have more free time to work on this project, I will hopefully start posting more frequently. Until then however, it's farewell. Thank you for reading!
diff --git a/posts/hestia/hestia.webp b/posts/hestia/hestia.webp
new file mode 100644
index 0000000..bc5a4d4
--- /dev/null
+++ b/posts/hestia/hestia.webp
Binary files differ
diff --git a/posts/hestia/meta.json b/posts/hestia/meta.json
new file mode 100644
index 0000000..8fc8bfa
--- /dev/null
+++ b/posts/hestia/meta.json
@@ -0,0 +1,10 @@
+{
+ "title" : "Hestia: the homemade home assistant",
+ "publish_date" : "2025-01-29",
+ "tags" : [
+ "hestia",
+ "voice_assistant",
+ "home_assistant",
+ "python"
+ ]
+}