Alternative(s) to an SQL-based word-level inverted index

So I built a database of documents. As of this writing it contains 18,693 HTML documents, which  combined contain 1,453,866 paragraphs. Once the most common words have been excluded ('the', 'of', 'to', 'and', 'a', 'that', 'in', 'is', 'for', 'or', 'on'), there are 272,708 distinct words. So far so good. The issue I’m slamming up against now, though, is the word-level inverted index (wordIndex) I built so I could do phrase searching, find WordA within 5 words of WordB, find documents with WordC, WordD, and WordE all appearing in the same paragraph, etc. Kind of a poor man’s LexisNexis. I rolled my own for the mental exercise, but now I’m wondering if there’s a better way.

Profiling the PHP code that builds the index, the single most expensive call is writing to that wordIndex table. It takes an average of about 10 minutes to index each document; here’s a representative import session:

Words¶sSeconds Elapsed
3571118324.9
6791873.4
215277206.7
157267197.6
7652175730.1
7639104694.4
273553249.2
3301158355.5
472692466.3
10318252844.8
155075174.5
3535118357.6
2371124.5
4316168385.8
217268197.8
488599457.1
6876181635.1
5342170488.1

The reverse word index is 44,697,684 rows. The structure is:

+----------------+----------------------+------+-----+---------+-------+
| Field          | Type                 | Null | Key | Default | Extra |
+----------------+----------------------+------+-----+---------+-------+
| uuid           | char(36)             | NO   | PRI | NULL    |       |
| word_uuid      | char(36)             | YES  | MUL | NULL    |       |
| document_uuid  | char(36)             | YES  | MUL | NULL    |       |
| paragraph_no   | int(10) unsigned     | YES  | MUL | NULL    |       |
| word_no        | int(10) unsigned     | YES  | MUL | NULL    |       |
+----------------+----------------------+------+-----+---------+-------+

(I’m using uuids throughout because I ran into issues with autoincrement primary key integers and older MySQL / MariaDB replication. (I’m hosting the master database on a server I run at home, pushing to a copy on an inexpensive Linux VPS where I host an interface to the database for other lawyers to use.) (CentOS 6.10 at home - I put the server online about 8 years ago - currently running MariaDB 5.5.60 - and CentOS 7.6.1810 on the remote instance, same MariaDB version.)

Anyway. I’m starting to wonder if I’m using the right approach - or more specifically, where I should start educating myself, to put me on the path to understanding and deploying the right technology. My little hobby project has actually become a very usable tool - even got some beer money through the donation link 😋. So now I want to level it up. My skills, too.

Where do I start? My initial focuses (foci?):

  • Best way to implement the word-level inverted index or its functional equivalent. (Maybe in a way that’s “webscale”?)
  • Best way to do word grouping (stemming, thesauri lookup (where, e.g., dog, puppy, canine would all find the same word, if strict matching wasn’t specified in the query))


‡ A little about me: I’ve been dabbling in programming since my days of banging out AppleSoft BASIC on an 8-bit Apple //e, and took some programming courses in college, but I’m mostly a self-taught hacker with no real formal or rigorous training. I last banged out PHP and Perl code for a payday about a decade ago, and became an Oracle Certified Professional Java Programmer about six years ago mostly to see if I could...

Comments