Note to Shelf: Music, Language, Zipping

Some time ago, it was discovered that looking at the dictionary of the ZIP algorithm applied to literature could identify its source language. (I can't find the link, too many zipped binary packages about languages.)

I wonder if you can identify authorship using zip dictionaries.

I also wonder what you can identify by zip-dictionary-analysis of music.  Genre?  Performer? Composer?


Victor said...

I doubt it. it probably works for ZIP archives by means of letter frequency analysis -- each language has a default letter frequency signature, I imagine. I don't see how that could be expanded t the characteristics you mentioned.

Jeremy Rice said...

[nod] On later introspection, I realized the flaw--at least for music.

It's somewhat misleading to say language recognition would be done by letter frequency: it would be more accurate to say "cluster frequency", since the ZIP algorithm is essentially finding patterns of adjacent terms and reducing them to single dictionary entries.

So where this would break down with sound is in the fact that the only information stored sequentially is the highest frequency sounds: and these are the most variable and least characteristic of the spectrum. ZIP dictionaries wouldn't tell you shit about the file.

If the method for storage of musical scores weren't so asinine*, I would imagine ZIPing scores would tell you something about the music, since certain clusters of musical information are characteristic of genre, if not composer.

...However, I'm not so sure about authorship. The use of closed-class items (shorter words) is one of the recognizable traits for a writer (as are spelling mistakes), and those are the most likely to be identified as reoccurring clusters... so I think there may be some merit to that line of thought.

Probably not much, though. : )

* This, based on my knowledge of the MIDI file format only.