Alternative(s) to an SQL-based word-level inverted index

So I built a database of documents. As of this writing it contains 18,693 HTML documents, which combined contain 1,453,866 paragraphs. Once the most common words have been excluded ('the', 'of', 'to', 'and', 'a', 'that', 'in', 'is', 'for', 'or', 'on'), there are 272,708 distinct words. So far so good. The issue I’m slamming up against now, though, is the word-level inverted index (wordIndex) I built so I could do phrase searching, find WordA within 5 words of WordB, find documents with WordC, WordD, and WordE all appearing in the same paragraph, etc. Kind of a poor man’s LexisNexis. I rolled my own for the mental exercise^‡, but now I’m wondering if there’s a better way.

Profiling the PHP code that builds the index, the single most expensive call is writing to that wordIndex table. It takes an average of about 10 minutes to index each document; here’s a representative import session:

Words	¶s	Seconds Elapsed
3571	118	324.9
679	18	73.4
2152	77	206.7
1572	67	197.6
7652	175	730.1
7639	104	694.4
2735	53	249.2
3301	158	355.5
4726	92	466.3
10318	252	844.8
1550	75	174.5
3535	118	357.6
237	11	24.5
4316	168	385.8
2172	68	197.8
4885	99	457.1
6876	181	635.1
5342	170	488.1

The reverse word index is 44,697,684 rows. The structure is:

+----------------+----------------------+------+-----+---------+-------+

| Field          | Type                 | Null | Key | Default | Extra |

+----------------+----------------------+------+-----+---------+-------+

| uuid           | char(36)             | NO   | PRI | NULL    |       |

| word_uuid      | char(36)             | YES  | MUL | NULL    |       |

| document_uuid  | char(36)             | YES  | MUL | NULL    |       |

| paragraph_no   | int(10) unsigned     | YES  | MUL | NULL    |       |

| word_no        | int(10) unsigned     | YES  | MUL | NULL    |       |

+----------------+----------------------+------+-----+---------+-------+

(I’m using uuids throughout because I ran into issues with autoincrement primary key integers and older MySQL / MariaDB replication. (I’m hosting the master database on a server I run at home, pushing to a copy on an inexpensive Linux VPS where I host an interface to the database for other lawyers to use.) (CentOS 6.10 at home - I put the server online about 8 years ago - currently running MariaDB 5.5.60 - and CentOS 7.6.1810 on the remote instance, same MariaDB version.)

Anyway. I’m starting to wonder if I’m using the right approach - or more specifically, where I should start educating myself, to put me on the path to understanding and deploying the right technology. My little hobby project has actually become a very usable tool - even got some beer money through the donation link 😋. So now I want to level it up. My skills, too.

Where do I start? My initial focuses (foci?):

Best way to implement the word-level inverted index or its functional equivalent. (Maybe in a way that’s “webscale”?)
Best way to do word grouping (stemming, thesauri lookup (where, e.g., dog, puppy, canine would all find the same word, if strict matching wasn’t specified in the query))

^‡A little about me: I’ve been dabbling in programming since my days of banging out AppleSoft BASIC on an 8-bit Apple //e, and took some programming courses in college, but I’m mostly a self-taught hacker with no real formal or rigorous training. I last banged out PHP and Perl code for a payday about a decade ago, and became an Oracle Certified Professional Java Programmer about six years ago mostly to see if I could...

Search This Blog

A Geek Goes Flying

Alternative(s) to an SQL-based word-level inverted index

Comments

Post a Comment