Friday, November 24, 2006

I feel this should be viewed in the context of the Hutter prize and it is important to examine how this will affect AI. The program given simply compresses a series of integers, although thw program is eventually intended to provode for compression based on constructing a probability table.

The arrangement is in order of decreasing probability. Your first word in a string is simply the mean frequency of every word. The list for the second word is the probabilities BEARING IN MIND THAT THE FIRST WORD IS ******. I was assuming that the list was prepared using DCA. OK something else could be, my rationalizations like always feminine singular is the way I think. DCA if applied will bring out gender differences with eigenvctors containg la, el, los, las.
DCA of course talks the language of probability, it does not attempt to parse. It might implicity parse under certain circumstances, in all European languages, with the exception of English there is gender and agreement. DCA will pick this up and could be described as a partial parsing.

A diagram of what is happenning is given below.

Here is the guts of it There are a few algorithms in here which are not really relevant. You can compile it into whatever system you wish. There are some I/O routines which the user will have to write for himself. They are fairly trivial.

/* This program is designed to achieve the ultimate in compression. THere are codes like the Huffing code that lose very little
entropy. This provides for the ultimate. It is a well known mathematical fact that the number of ways of arranging objects in holes
is N!/(Pi(n(i)!(N-Sigman(i))!) This assumes that each site is occupied by 1 and 1 only object.
The problem then is to enumerate each individual configuration. If each configuration can be numbered compression has been achieved.
Consider N!/(n!(N-n)!) If we can evaluate that expression the complete expression can be evaluated. For we can work out the
configuration for n(0), We are now left with N-n(0) places. Then n(1), N-n(0) and so on. The algorithm used is the following:-
If we take the n[0]th poaition of number n[0] and place it somewhere we know that the number of arrangements of the remaining
n[0]-1 will be P!/((n[0]-1)!(P-1-n[0])!)
This gives us a scheme of numbering. If we calculate Sigma(P!/((n[0]-1)!(P-1-n[0])!)) as ball -> P position we have a numbering scheme.
class Factorial{
/* This class is the class of factorials. arr starts off with 1,2,3.....n It is created in this form by Factorial(int).
divide divides by an integer. It does this by finding the Highest common factor using Euclid's algorithm, and divides
repeatedly until the number you are divising by becomes equal to 1. If division is by a factorial the numbers are taken
in turn. In this way we get N!/(n!(N-n)!) in is simplest form.
int n;//The number of numbers in the factorial
int *arr;
Factorial(int);//sets up factorial. Initially the numbers are sequential
void divide(Factorial);//This divides by another factorial. It does this to limit the size of the numbers.
void divide(int);//This divides by a single integer
class Big{
/* This is the Big integer class which ios the guts of the program.
n is the number of I*4 wordsin the integer
*arr is the array of numbers
We need to be able to perform the basic arithmetrical operations and in addition needs to have a remainder when dividing by a big
integer, Basically we use these remainders are the numbers of N!/(n!(N-n)!). To make division easier the big integer we are
dividing by is presented as a series of I84 integers. We divide by each in turn.
int n;
int *arr;
void add(int);//Add I*4 integer
void add(Big);//Add another
void sub(Big);//Subtract big integer
void mul(int);//Multiply by I*4 integer
void mul(Big);//Muliply by big integer. In normal form, not in the special form for division.
int div(int);//Divide by I*4 inter
Big div(Big);//Big integer division
Big(int);//create a big integer of length n
Big();//create a big integer without arrays
Big(int*,int);//Create a big integer from the form for division
Big(Factorial);//Convert a factorial into a big integer
void wr(CFile*);//Write big integer onto file
void rd(CFile*);//read big integer from file
int bits();
boolean operator<(Big a);//These are simply comparison operators boolean operator>(Big a);
class Compress{
/* This class contains the complete compressed information - sufficient for loss free compression.
int n,max;
// n is the number of differnt numbers in the string.
// max is the highest number in the string, not necessarily the same as the above.
int *fact,*trans;
// fact contains the number of numbers in each non zero category
// trans is a table for translating each number into its value.
Big *b;//The big integer
void ecode(int*,int);//Encode
void rd(CFile*);//read and write files
void wr(CFile*);
int* dcode(int* n);//decode
class Entry{
CString s;
int q;
unsigned int ind;
class HashArray{
int size;
int last[3];
int places,nplaces;
int nvar;
int pos,posa,mvar;
int *len;
unsigned short **mat;
int **res;
void stats(int*);
int in;
int ksiz;
Entry *array;
int put(char*);
int get(char*);
boolean IsMatch(char*);
void Del(char *);
void rd(CFile*);
void wr(CFile*);
void sort(int*);
int init(int);
int combine();
void Vect(int*,int);
void del();
Big encode(int*,int,int);
void decode(Big,int*,int,int);
boolean IsLetter(unsigned char);
double Gosper(int);//This is a routine that gives the log of the factorial without actually evaluating it. It calculates entropy.
double kern(int);
void Diag(int*,int);
void Dev(double*,double*,int);
double *VecMat(int *vec);
int decomp(unsigned char*,int*,int);
int compr(int*,unsigned char*,int);

Factorial::Factorial(int m){//This constructs a factorial of size m
arr=new int[n];
for(int i=0;i>1;
return j;
return j;
Big::Big(Factorial f){//Create a big integer in divide form from a factorial
unsigned int *ar,q;
int mm=1+f.n/2;//Inconcievable resultant integer as big as this
ar=new unsigned int[mm];
for(int i=1;iWrite(&m,sizeof(int));
void Big::rd(CFile *f){//Read a big inter from file ff
arr=new int[n];
boolean Big::operator<(Big a){//These two functions allow us to use inequalities in the way they are used for ordinary unsigned // integers in C++ and Java for(int i=0;i(unsigned int)a.arr[i])return false;
return false;
boolean Big::operator>(Big a){
for(int i=0;i(unsigned int)a.arr[i])return true;
if((unsigned int)arr[i]<(unsigned int)a.arr[i])return false; } return false; } Compress::Compress(){//Creates an empty compress class n=0; } int *Compress::dcode(int *qq){ /* This subroutine decompresses a string. It does this by first looking for the positions of 1s then 2s etc. It decodes each individual string. *qq on exit contains the length of the string. */ int len=0; int *arr; int i,j,k,m; Big *mul;// Contains the big integers for multiplication for(i=0;imax)max=fact[i];//Get longest string
int *pos=new int[max];
for(i=n-1;i>0;i--){//Loop is in inverse order
Big d=b[0].div(mul[i-1]);//Divide. Remainder represents the configuration
int mm=fact[n-1-i];
decode(d,pos,mm,d.n);//Get string
for(j=0;j=mm)break;//are we at end
delete d.arr;
delete mul[i-1].arr;
delete pos;
pos=new int[n];
while(i==0){//pos contains the translation table
if(m<1)continue; i="1;" k="0;kRead(&max,sizeof(int));
fact=new int[n];
trans=new int[max+1];
void Compress::wr(CFile *f){//write class to file f
void Compress::ecode(int *str,int len){
/* This subroutine encodes a string of length len in *str. In order to facilitate decoding it takes the highest number first
int i,j,k,m,nbits=0;
Big *mul,*val;
b=new Big[1];
fact=new int[max];//Get numbers of each number in string
trans=new int[max+1];
for(i=0;ib2)){//Compare. When the number of positions is greater than the big integer we have just passed position
b2.sub(b3);//Subtract b3
b3.mul(j-1);//Calculate number of permutations for the rest
delete b3.arr;
Big encode(int *pos,int npos,int dim){
/* This one encodes. It is the analogue of the one above.
Big b2(dim);
Big b3(dim);
int ma=npos;
for(int i=0;ij)str[i]++;
for(int ii=0;iiif(bins[ii]==0)continue;
goto loop;
int euclid_algorithm(int i,int j){//This algorithm impliments euclids algorithm for integers i and j
int k=1;
return i;
bcopy (utility function)
/* This subroutine copies one string onto another.
len bytes are copied from x to y*/
void bcopy(void *x,void *y,int len){
int i=len/4;
int j,k;
mov esi,x
mov edi,y
mov ecx,i
rep movsd
mov j,edi
mov k,esi
int *xx=(int*)x;
int *yy=(int*)y;
mov edi,j
mov esi,k
xor eax,eax
mov ecx,len
and ecx,3
rep movsb


Anonymous Anonymous said...

You said:
The problem then is to enumerate each individual configuration. If each configuration can be numbered compression has been achieved.

How do you figure that? Even if true, that does not mean that they represent minimum loss-free compressions. There is something missing here. (I don't completely understand the formulas but I get the idea. I would like to start with an informal explanation if possible)
Jim Bromer

7:14 AM  
Blogger Ian Parker said...

It represents the maximum loss free compression provided that there is no relationship between the words. Further compression can be made iff such a relationship extis.

It is important to understand this iff thouroughly. You get about 20MB for the Hutter corpus if you assume no relationship between the words. The best compression so far is 17MB. This is achieved using fragment compression. Now there is a relationship between a word's meaning and its syllabic structure. Most students of a language readily appeciate this but the Hutter prize has proved this in quite a rigorous way.

It is also true to say that an AI technique based on word frequency must be capable of compressing to 17MB or less. This is because we can get from an integral representation to a syllabic representation be means of a matrix multiplication.

As M1*M2 = M3 this resolves itself down to a matrix. A product matrix (suppose M2 is the identity matrix)must compress at least as much.

5:09 AM  

Post a Comment

<< Home