FWIW: From a hardware/implementation perspective, the main feature of Prolog is that it performs exhaustive search, and, gives the programmer a way to specify such searches in a declarative - less buggy - way. Loops within loops do this, also, but provide more opportunities for inserting bugs. Prolog does this by generalizing and using backtracking - a technique essentially frowned upon in the early days of computing (due to hardware limitations) (now possible once again).
MiniKanren also does exhaustive search, but doesn't use backtracking, trading off memory usage instead. The canonical "assembler" for Prolog is WAM - the Warren Abstract Machine - which is used by GNU Prolog (iiuc, GNU Prolog implements Prolog in Prolog (it can be told to show the resulting WAM, which was useful to me when I was trying to write a WAM in Lisp)). A write-up of WAM principles can be found in
Kaci's. Various lisp-based implementations are documented in
PAIP and
On Lisp and others. IMO, the most understandable implementation of Prolog is
Nils Holm's Prolog Control in 6 Slides (the
tx3.org website is 404'ing on me at this moment). The Holm version is written in Scheme. I found Holm's version so understandable that I even managed to hand-port it to
Common Lisp and to mechanically port it to
Javascript (the main thrust of this was to explore OhmJS, not particularly Prolog, but, it appears to work). I think that the way to speed up a Prolog program, is to remove all generalizations from a specific program, i.e. take a given (working) program and to pre-compile it into a bunch of nested loops written in assembler (and, for extra oomph, remove all need for context-switching). The product of any programming language is to create assembler code for use on a CPU. Some compilers do this by emitting only assembler, some do it by emitting assembler that leans on an engine. I think that Prolog fits into the 'engine' category. Many popular languages fall into the 'engine' category where the engine happens to be a lump of code that implements context-switching (often called "operating systems", which usually burn a lot of CPU cycles (something like 7,000-11,000 cycles per context switch, according to Claude 3.5)). Or to find ways to parallelize it (noting that LLMs operate on the principles of massive parallelization, but end up lying to you on occasion (i.e. LLMs in their current state, can't be trusted and they ain't Engineering)). Sequential programming techniques and languages, essentially oppose the existence of massive parallelization, requiring one to think hard to achieve it.