The Perils of File Typing

patrec · on Dec 6, 2020

> That said very few operating systems actually store MIME types, [...] MIME types are almost always derived from file extensions or magic numbers.

OpenDocument is generally a horrible format, but there is one clever aspect about it which I wish more file formats would adopt, but which is I think not widely known. Since it ought to be, here is a quick rundown: All OpenDocument subformats like ODT are zipfiles where the first entry is an uncompressed file with the name "mimetype" that has the mimetype (e.g. application/vnd.oasis.opendocument.text) as content. The way zip files are laid out, this means the string "mimeetype$ACTUALMIMETYPE" ends up next to the pkzip magic numbers right at the beginning at the file and can be robustly detected by sniffing just a few bytes at the start of the file.

http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-...

Sadly, few other zip based file formats seem to do that, in particular docx and family do not. If you ever find yourself in the situation that you design a zip based file format, please adopt it.

greggman3 · on Dec 6, 2020

Assuming that the data at the start of a zip file is valid is an invalid way to interpret a zip file. Zipfiles start at the end ALWAYS. The format was designed to be able to just jump to the end of the file, read the table of contents, append new data, write a new table of contents. That new table of contents might not even reference the data at the front of the file.

It was designed this way because it comes from the era of floppy disks. If you had a zip file spanning multiple discs and you wanted to update `firstfile.txt` at the front of the zip file you didn't want to have to insert all 10 disks and read and re-write all the data. You just wanted to insert the last disk, read the table of contents, append your new `firstfile.txt` and write a new table of contents. The invalid `firstfile.txt` still exists at the front but it's ignored by the table of contents.

If you're going do stupid tricks by scanning from the front then you'd be better off choosing a format for which that's actually valid like .tar.gz or something

JeremyBanks · on Dec 7, 2020

It's valid if you're not reading a generic zip file but instead a zip derived format like this.

alisonkisk · on Dec 6, 2020

Since editing a document doesn't change its type, this objection doesn't apply to zip-based filetypes.

epitactic · on Dec 6, 2020

> zipfiles where the first entry is an uncompressed file with the name "mimetype" that has the mimetype

The EPUB format also adopted this convention: https://www.w3.org/publishing/epub3/epub-spec.html#sec-intro...

"The EPUB Publication's resources are bundled for distribution in a ZIP-based archive with the file extension .epub. As conformant ZIP archives, EPUB Publications can be unzipped by many software programs, simplifying both their production and consumption.

The container format not only provides a means of determining that the zipped content represents an EPUB Publication (the mimetype file), "

ancarda · on Dec 6, 2020

Seems the dollar is missing - I only see "mimetypeapplication/vnd.oasis.opendocument.text". Nevertheless, I really like this approach. I will remember it in future.

    $ hexdump -C CV.odt | head -n 5
    00000000  50 4b 03 04 14 00 00 08  00 00 9c ad 7e 51 5e c6  |PK..........~Q^.|
    00000010  32 0c 27 00 00 00 27 00  00 00 08 00 00 00 6d 69  |2.'...'.......mi|
    00000020  6d 65 74 79 70 65 61 70  70 6c 69 63 61 74 69 6f  |metypeapplicatio|
    00000030  6e 2f 76 6e 64 2e 6f 61  73 69 73 2e 6f 70 65 6e  |n/vnd.oasis.open|
    00000040  64 6f 63 75 6d 65 6e 74  2e 74 65 78 74 50 4b 03  |document.textPK.|

layer8 · on Dec 6, 2020

The GP intended $ACTUALMIMETYPE to be read as a variable — there is no $ after replacement.

ancarda · on Dec 6, 2020

Oh silly me, of course. I thought perhaps $ denoted the end of a filename and the start of the contents in Zip. I don't know much about that format.

jez · on Dec 6, 2020

Another little known file type feature is that if you’re using a sqlite database file as an application file format, there are a handful of bytes reserved for applications to register that “this is a SQLite file for use specifically with application X”

Application ID docs:

https://www.sqlite.org/draft/fileformat2.html#application_id

yencabulator · on Dec 7, 2020

Using SQLite as a transport format is not a good idea: https://research.checkpoint.com/select-code_execution-from-u...

PeterisP · on Dec 6, 2020

Given your description, it seems that any software writing a docx file could do just that without breaking compatibility.

alexvoda · on Dec 6, 2020

I believe Uniform Type Identifiers and Bundles are a much nicer solution. A real shame only Mac OS X supports them. I would love to see them implemented in other OSs.

patrec · on Dec 7, 2020

UTIs are nicer than mime types in that they allow a simple way to make up a unique name you control (java style com.myurl.myformat) and in that there is an inheritance mechanism (you can say com.myurl.myformat also is a sqlite database, for example).

But unless I misunderstand something the fundamental problem of figuring out what particular logical type some collection of bytes on your disk has is in fact not solved at all, because the UTI is not stored together with the file in the filesystem (unless I'm misunderstanding something). As far as I can tell macOS, apart from a few places like the clipboard, basically just takes a guess based on registered file extensions (or, for rest calls, the mime type) -- is that not so?

kergonath · on Dec 6, 2020

That is actually brilliant. I had no idea, thanks for the information.

btschaegg · on Dec 6, 2020

TIL. Thanks for sharing that!

Someone · on Dec 6, 2020

“Instead of file extensions, the Macintosh used type and creator codes. These were 4-letter identifiers, much like file extensions, that identified both the type of file and the application used to create it.”

Technically, they were 4 bytes. Convention was to use bytes that happened to be 4 characters. Also, “that identified both” could be written clearer. There were 32 bits to identify the type and 32 to identify the creator. The Finder used the creator to determine the application to launch; file open dialogs used the type to filter files.

“When ran for the first time, programs would tell the OS what its creator code was and which file types it supported.”

It was better: when the use copied an application to a disk or moved it around, the Finder read the information from the application’s resource fork and updated the database.

“The OS would then save this information on the boot disk.”

Not (necessarily) on the boot disk; on the disk containing the application.

InvisibleUp · on Dec 6, 2020

Thank you for your corrections. I've updated the article.

lmilcin · on Dec 6, 2020

So here is a lesson I take from these kinds of problems.

Look at the interface of your program. Are there any elements that seem to be arbitrary decision on your part as a designer that are there only so that you have easier time?

If yes, even though it might seem a good idea today, there is a chance you will need to change it in the future or it will become a sore for some of your users.

Always try to understand the problem in terms of underlying fundamental rules/definitions and only use these for the outside surface (contract, API) of your system. Fundamentals don't change. If they change it most likely means you had wrong understanding of the underlying problem in the first place.

Using only fundamentals for your API will mean easier integration with clients and interoperability with other systems.

As an example, if you implement an application that assumes that a user account is connected with an employee and the employee only has one manager and that only the manager does approvals you might find that yes, the application works but also you will find yourself an a lot of problems pretty soon when you find that there are times or parts of your organization that don't follow this pattern.

- an outside contractor might be brought that needs an account,

- an employee or contractor can have multiple managers,

- an employee or contractor may need multiple separate accounts,

- the manager may need to deputize somebody else to do approvals,

- the manager may want to decide they want to run their organization differently and trust their employees to do the right thing (but can't because the app forces approval flow),

etc.

These problems happened because of naive understanding of the model of the system (ie. the model does not correspond to the fundamental properties of the system) or because the designer decided it is too much work to implement these.

hexxiiiz · on Dec 6, 2020

Went down this road recently. I wrote some python code to handle some big archives of music, pdfs, and other media, normalizing the names of everything. I decided to try and correct the extensions to indicate the file type properly. This turned out to be a little complicated. In most cases the libfile estimate was good (using the magic number) and provides an mimetype as an output. However, it sometimes overgeneralized the file type to something more general, or outright flattened it out to "binary data". To make this more robust, I used pythons mimetype library to infer the mimetype from the filename as a secondary source of the information. I then needed to use a set of heuristics to reconcile the two mimetypes: derived from libfile and from the file name. This works pretty well to identify consistent cases, getting them out of the way. When the libfile mimetype is precise it is usually safe to fix the extension, particularly if the difference is just audio or image format. Nonetheless, there are a lot of corner cases. If I were even more ambitious, the tough cases could probably be drilled down on further with some media metadata utilities. I am curious if someone has just worked this out already in the form of a library.

alexvoda · on Dec 6, 2020

This article only mentions in passing the best approach so far: Uniform Type Identifiers (1). When it comes to file types and associations Mac OS X hit a homerun. This mechanism way more powerful than MIME types because:

UTIs use a reverse-DNS naming structure. UTIs support multiple inheritance (which is a good thing in this context) Using UTIs allow per file exceptions.

It really is a shame this was not implemented by any other OS than Mac OS X. It is one of the brilliant technical features it has. Together with bundles (2), they make for very powerful and flexible filing.

(1) https://en.wikipedia.org/wiki/Uniform_Type_Identifier

(2) https://en.wikipedia.org/wiki/Bundle_(macOS)

dkmar · on Dec 7, 2020

Dynamic UTIs in macOS started to cause problems for me when I upgraded to Catalina.

Setting defaults for 'Open with' on these file types seems pointless as I'm forced to do it once per each dynamic UTI (as far as I can tell).

It's also annoying that quicklook (extensions) started playing dumb with them — you'd think that QL would look for the lowest familiar ancestor or something.

I'm not sure whether the file typing approach changed with catalina, or if the migration from qlgenerators to quicklook extensions broke things, or if it's just catalina being buggy but it's disappointing that the typing situation seemed better back in mojave.

ksherlock · on Dec 6, 2020

BeOS used mime types.

Practical File System Design with the Be File System - http://www.nobius.org/dbg/practical-file-system-design.pdf

ecpottinger · on Dec 6, 2020

BeOS and Haiku not only used MIME types but they also use the magic number approach to add a MIME type to a file if they only got the raw file.

Also unless I read the docs wrong SmallTalk added MIME types for creator/editor/viewer to a file. So you could find what created a file but could tell the OS to use a different program to edit the file and even still another program to just view the contents.

Please correct me if I am wrong.

kevin_thibedeau · on Dec 6, 2020

Win95 has MIME too in a bolted on sort of way. I think the only application is to associate them with a file extension so that email attachments via MAPI will get the right MIME type.

alisonkisk · on Dec 6, 2020

Not sure about win5 but I thought windows in general used a mimetype database for associating file extensions to types and types to programs.

AtlasBarfed · on Dec 7, 2020

Really there are two:

You have metadata in the file (magic numbers, etc)

You have metadata outside the file that is attached

Meta in the file will move between OSs.

Meta outside the file might not.

What we probably need is every file in all OSs to have an agreed upon metadata file header. I mean, it's 2020... but that is some serious cat herding.

Back when I was saddled with Documentum and the great problem of extracting metadata from office docs, this also reared its head. A common metadata block would have been nice so you didn't have to code against hostile file formats (not that Office would play nice with a common metadata block, but maybe customers could force them. Maybe. Well, I can dream)

zmix · on Dec 6, 2020

On AmigaOS filetype recognition was very sophisticated. There were several ways to do it:

1. Filemanagers allowed you to configure a full combination of tests, which included magic bytes, binary or string seek to location and test for value, filename suffix, that all as often as liked, combined with boolean operators.

2. The system's datatypes.library had little slave drivers, that would "implement" a filetype, including any code a developer could write to do complex recognition. Modern filemanagers allowed to combine this with (1)

3. There was a shared function library (freeware), that did nothing but filetypes recognition, configurable by the user.

How I miss these days...

transfire · on Dec 6, 2020

Type and Creator Codes are the way. The given cons are implementation details, not immutable problems.

greggman3 · on Dec 6, 2020

No, they aren't and there are good reasons OS-X ditched them from MacOS 9

The biggest problem was passing files across the net via other systems (unix, windows). Pass .JPG, windows and linux just want the .JPG but mac needed this extra info or it didn't understand the file making all files from Macs incompatible with the rest of the world and visa-vera

It was a nightmare

Mikhail_Edoshin · on Dec 7, 2020

So the downside is that it's not fully compatible with inferior systems? :)

jasperry · on Dec 6, 2020

It's well known that Windows' default behavior of hiding file extensions is bad for security (lookatme.jpg.exe). But I suspect the real reason Microsoft has kept this default over the years is that it prevents users from accidentally breaking the file association when they rename the file.

All of which just further supports the author's assertion that filename extensions are the wrong way to do file types/associations.

gruez · on Dec 6, 2020

>But I suspect the real reason Microsoft has kept this default over the years is that it prevents users from accidentally breaking the file association when they rename the file.

I doubt it. There's already enough preventative measures if you have "view file extensions" enabled. If you try to rename a file, only the part before the extension would get highlighted, and if you try to change the extension you get a scary warning.

PeterisP · on Dec 6, 2020

As far as I remember, those preventive measures were not in place back when hiding file extensions was introduced, which was either in 2001 with Windows XP or perhaps even earlier.

lcall · on Dec 8, 2020

Maybe irrelevant but FWIW, OpenBSD rewrote the "file" command from scratch to be more secure. I don't know how far that goes (like in the article "magic numbers" section discussing polyglot files, for example), but they have had success making other things more secure (ssh, cvs, smtp, ntp, ...). Maybe they just made "file" less likely to have bugs discovered in the future.

coldtea · on Dec 6, 2020

>why we read files using rewind, fseek, fread, and fwrite as if we were on a tape drive still

Not sure about rewind (is it the same as fseek(0)?), but regarding fseek, fread, and fwrite, they seem compatible with the abstract idea of "file as a series of bytes" and not particularly tied to the "stored on a tape drive" part.

What would be a modern era API for file reading, moving inside, and writing, that doesn't carry the "tape drive" heritage as implied by the author?

lapinot · on Dec 6, 2020

I would guess mmap: view it as a char[]. With flash storage and ubiquitous page-cache the read/write/seek reading-needle abstraction is kind of moot. Triggering a syscall on seek when you could just pass an offset to every read is a bit disturbing. It's probably because of this that most languages provide buffered readers even tho the underlying file is already buffered by the kernel.

alisonkisk · on Dec 6, 2020

Right. The opposite of random access is sequential access.

Tapes are extremely sequential, HDD are less sequential, SSDs are barely sequential, and memory mapped files are almost entirely non-sequential

eqvinox · on Dec 7, 2020

related: https://www.freedesktop.org/wiki/CommonExtendedAttributes/ (how to store MIME type in file system extended attributes)

amelius · on Dec 6, 2020

They forgot about another option: put code in the metadata which is executed when the file is opened, previewed, etc.