disjoindisjoin - in a way does the contrary of a database "join"
A script to do the contrary of a database "join" on two "comma separated values" (CSV) database text files.
The name of this tool is "disjoin" for reasons that I hope will become apparent in a moment.
The tool solves the problem of doing set operations (like intersection, difference, complement) on (plain text) database files.
A (CSV) text database file is a text file where each line corresponds to one record of the database.
Each record is divided into fields by some field separator character or string (not necessarily a comma).
One or more fields (NOT necessarily adjacent!!!) form the (unique) key of each record (like last name and given name for a person, for instance).
Now suppose you have two such database files, of which you want to know if they share any key values (if there are any people appearing in both database files, for example).
And suppose you want to split your two database files into two parts EACH; one part with the records that have keys that do not appear in the other database file, and another part with the records that have key values that appear in BOTH database files.
(Meditate over the fact that even for keys appearing in BOTH files the data associated with them is not necessarily the same!)
By the way:
The tool allows you to specify a regular expression (in Perl syntax) for determining the field separator character(s).
See the online help (call "disjoin" without parameters or with a parameter "-h" or "-?") for more details on this (option "-F").
To define which fields form the key, use the "-L" option (it takes a comma separated list (without spaces) of field numbers as its argument).
Note that counting starts at one, not zero. If you use the field with number zero in your key field number list, it always returns the empty string, thus not doing any harm (but also no use) when used.
To better illustrate what the tool does, two diagrams:
______ / \ ______ / \/ \ ( / ) \ ( A ( ) ) ( ( C ) ) ( ( ) B ) \ ( / ) \______/\ / \______/
The set ( A + C ) is the set of the keys contained in File_A, the set ( B + C ) is the set of the keys contained in File_B, and the set ( C ) is the set of the keys contained both in File_A und File_B.
Data flow diagram:
File_A File_B | | \ / \ / \ / | | comparison ---> ====+====+==== <--- comparison ( Pass 1 ) | | | | | / \ | V / \ V controls selector ---> /| |\ <--- controls selector ( Pass 2 ) / | | \ / | | \ / | | \ / | | \ / | | \ | | | | V V V V File_A.1 File_A.0 File_B.0 File_B.1
Mnemonics: On computers, the zero ("0") is usually represented by an "O" with a slash ("/") through it. This is also a symbol for "intersection" in set theory. The one ("1") symbolizes "uniqueness".
Note that the original two database files are NOT modified, just read!
[Further explanations under construction]
"Disjoin" is meant to be a universal tool, which - by repeated and recursive application - makes more complex set operations possible.
An analogy should illustrate this:
(Click on the image to enlarge it)
By combining four logical "NAND" gates as shown above, it is possible to create a totally different kind of gate, a "XOR" gate.
Moreover, "NAND" gates are logically "complete", which means that ANY kind of gate can be realized using only "NAND" gates. (For a proof of this theorem, see the relevant literature.)
In the same way, by repeatedly using "disjoin" on the results of previous runs of "disjoin", it should be possible to perform ANY kind of set operation on your data, no matter how complex the operation is! (There's no formal proof yet that "disjoin" is "complete" in terms of set operations - but maybe you can provide one?)
Copyright © 1997 - 2016 by Steffen Beyer