XDG_STATE_HOME for the location of history data

akinomyoga · January 17, 2024, 1:17pm

This is just to mention XDG_STATE_HOME since it doesn’t seem to be mentioned anywhere in GitHub/Discord/Forum so far. This is not a request to use XDG_STATE_HOME but just a discussion.

We are currently storing the history database in $XDG_DATA_HOME or its default ~/.local/share. Meanwhile, relatively recently (two years and half ago) Freedesktop.org (former X Desktop Group aka XDG) introduced XDG_STATE_HOME in XDG Base Directory Specification [1] for the location of states of applications including logs and histories. The detailed backgrounds are found in Refs. [2,3].

[1] Environment variables
[2] New XDG_STATE_HOME in XDG Base Directory Spec : r/linux
[3] XDG_STATE_HOME

The location XDG_DATA_HOME (~/.local/share) was originally intended as “/usr/share” for the applications installed in the user’s directory. /usr/share stores the static data distributed by the application packages, such as documentation, images, libraries, etc. However, since there was no location where the user data should be stored, applications started to use ~/.local/share for the secondary purpose of storing the user data. Then, XDG_STATE_HOME is introduced.

There are discussions for the shell histories. Fish shell refused the discussion [4]. The discussion in NuShell is still open [5]. Elvish is already using XDG_STATE_HOME to store history and other data. Bash and Zsh do not follow the XDG spec in the first place, but they instead allow users to set the config HISTFILE in startup files like .bashrc and .zshrc.

akinomyoga · January 17, 2024, 1:22pm

Here, I reply with my personal opinion. I’m not really convinced by the discussion in Fish. Here are my thoughts on the arguments there.

The maintainer has never seen it being used, but I think that’s just because the specification itself is new. Also, XDG_STATE_HOME is indeed used by applications in my machine, so it appears to be a personal experience of the Fish maintainer.
Another reason seems to be that the spec says XDG_STATE_HOME stores less important data than XDG_DATA_HOME, so the maintainer considers the data needing backup shouldn’t go into XDG_STATE_HOME. I feel this is a wording issue of the spec considering the proposal of XDG_STATE_HOME.

The resources in /usr/share are supposedly required for the application to work properly. In this sense, that is more important than the history and logs for the application’s perspective. However, the resources are not subject to backup because they are a part of the package or libraries automatically installed.

Also, it is nontrivial whether the author of the spec considered the command history of a shell is important. Recently, people have started to think it important (probably because we recently have ample storage to perpetually maintain the full command history), but traditionally the command history had an upper limit. For example, Bash only remembers 500 commands by default. I think the spec shouldn’t have described it by using the term “important”, because importance seems to be subjective and depends on the person.

On the other hand, the discussion in NuShell seems to be more reasonable. As discussed there, it is true that there is inertia to continue to use XDG_DATA_HOME following the convention even though it is not the intended one. I feel it’s tidy to separate the resources and the user data, so I believe it’s worth discussing the possibility of shifting to XDG_STATE_HOME, but it’s a matter of preferences after all.

ellie · January 17, 2024, 4:08pm

Mine too, noting that it defaults to ~/.local/state.

Quoting the spec here:

$XDG_DATA_HOME defines the base directory relative to which user-specific data files should be stored

The $XDG_STATE_HOME contains state data that should persist between (application) restarts, but that is not important or portable enough to the user that it should be stored in $XDG_DATA_HOME

While XDG_DATA_HOME may have not originally been intended for user data, convention has certainly led to that being the case. And the spec, while perhaps a little unclear, now also seems to suggest this. It’s perhaps a philosophical argument as to whether we should follow spec or convention, but I’d lean more towards convention.

I think it’s pretty safe to assume that if a user has Atuin installed, then they consider history to be important.

My thoughts about this for Atuin:

Changing would complicate support. Right now, we can comfortably tell users that “config is in ~/.config/.atuin, data is in ~/.local/share/atuin”. While following XDG_STATE_HOME might follow a Linux spec a bit more closely, I’d rather we stuck to the principal of least surprise and put things where users are familiar with it. If STATE becomes more well-known, I may be open to changing my mind there.
It’s very unclear what the spec actually means. It seems to imply that if XDG_STATE_HOME were to be wiped, applications would still function correctly. At present, most of the data we store is pretty critical. There are some files that could arguably be moved (eg, last update check, sync time, etc), but I’m not sure following the spec is more important than keeping things in once place.
Migrating defaults would be… frustrating. And would likely lead to breakages.

Otherwise, we do allow users to configure the location of various files. So if they feel strongly about this, they can change it.

Totally happy to discuss further, perhaps I’ve missed some things.

akinomyoga · January 17, 2024, 9:19pm

Thank you for taking the time to respond. Your decision totally makes sense. When the migration cost is taken into account, keeping the current one for now (and seeing how it goes if needed) would be a valid option. I just wanted to share the information and add a choice that might be taken in the future.

lilydjwg · January 18, 2024, 3:06am

>>> tree -L 2 ~/.local/state
/home/lilydjwg/.local/state
├── gnuplot_history
├── lesshst
├── mpv
│   └── watch_later
└── wireplumber
    ├── default-nodes
    ├── default-profile
    ├── default-routes
    ├── policy-bluetooth
    └── restore-stream

I have the same feel.

akinomyoga · January 18, 2024, 4:07am

Yeah, of course, if XDG_STATE_HOME were wiped out, it would affect the application’s behavior. When I wrote “the resources in /usr/share are required for the application to work properly”, I meant that the application wouldn’t crash. The resources in /usr/share are a part of the application and if it were lost, the application may crash or behave in an unpredictable way. I think the spec wanted to exclude such type of data to define “the state data” in its description. At least, it seems to be the interpretation consistent with the original proposal of XDG_STATE_HOME.

Besides, the specification doesn’t say “XDG_STATE_HOME would be wiped out”. It’s kind of the opposite; it says the content of XDG_STATE_HOME should persist. The one that can be wiped out is XDG_CACHE_HOME but not XDG_STATE_HOME.

ellie · January 18, 2024, 5:31pm

Hmmm I see. In any case, thank you for raising this! Happy to discuss further

In my poking around, it looks like XDG_STATE_HOME is presently used by applications for the kind of data where I (as a user) wouldn’t really care if it were to be removed.

For example, the GitHub CLI uses it for storing the last time it checked for an update + the latest release information

(we also do this, just storing it in XDG_DATA_HOME)

uep · January 23, 2024, 11:45pm

Broadly, I’m a fan of being specific about what kind of data belongs where, and in particular being able to use that for backup/replication policy settings.

Even just looking at the current specification as an ideal target state and ignoring past conventions and migration issues, my main observations are:

nothing (at least, nothing quoted in the above) makes clear the distinction between state and cache. Some of the descriptions that are attempting to make the distinction between data and state would also fit cache.
if ‘data’ is only to be static, delivered-by-the-application-package files, then why do I need to back it up at all? Surely I just “back up” the list of applications I want, and reinstall them? So actually they’re much less important than state. Historical exceptions are the only reason I would need to, right?

The guidance needs to be better if we’re to hope for more consistent implementation from apps. Regardless, every application will make their own choices. Sometimes it’s just not worth making the distinction; for example, perhaps the github timestamp example might go in ‘state’ rather than ‘data’, or ‘cache’ rather than ‘state’, but it’s just not worth having a whole extra directory tree for the one tiny file.

For atuin specifically, I think Ellie has it spot on (of course). Yes, it’s clear that atuin users consider shell history important, but does the classification (not importance) of that data change if it’s synced to a remote server or not? At that point it’s recoverable state, potentially even cache. Ambiguity in the spec aside, should we keep two databases, separating synced and unsynced shell history items, just to better follow the (assumed) intent? Surely not. (Also, my personal take is that the central server is the cache and point of interchange only, and is recoverable from any recently up-to-date client/s).

ellie · January 24, 2024, 2:52pm

I could not agree more

Also my thoughts on the matter! In theory only, I could wipe the whole database and all active users will re-upload their shell history pretty rapidly. Obviously not going to happen, but a fun thought exercise.

I’d love to see more details on the difference, and a clearer spec. But for now I don’t see our usage/implementation changing