This is the manual for older MicroStream versions (Version < 5.0).
The new documentation (Version >= 5.0) is located at:
Here is an overview of how to enable and configure different levels of user interaction for the Legacy Type Mapping.
Somewhere you have a foundation
instance, a foundation in where everything is configured, from which the StorageManager
is created.
It itself contains a foundation for connections. To access the inner thing needs a little detour.
Incidentally, that's not a JDBC connection, but this is just one thing that creates helper instances like Storer
and Loader
. Because Legacy Type Mapping affects loading, it has to go in there.
Either you access it directly, like this:
Or like this, that's better for method chaining.
If you have that, the configuration for the Legacy Type Mapping callback logic is just a one liner:
That's just the necessary logic, without anything further. If you do not change anything, this is done by default.
That wraps a printer around the necessary logic. All these storage and persistence classes are nothing sacred or super duper intertwined or anything. These are just interfaces and if you plug in another implementation then it will be used.
Resultor which asks the user to apply. More customization is possible, see below.
With the implementation of just one single interface method, you can build anything else you can imagine. For example, logging to a file instead of the console. Or in the personally preferred logging framework. Or write confirmed mappings into the refactorings file. Everything is possible.
For the inquiring implementation (InquiringLegacyTypeMappingResultor
) there are a few settings: When should he ask? Always or only if something is unclear. Never does not make any sense of course, then you shouldn't use it, or alternatively the printing resultor.
When is a mapping unclear? If at least one field mapping is not completely clear. A field mapping is clear if:
If two fields are exactly the same (similarity 1.0 or 100%)
Or if two fields are specified by the explicit mapping.
So if all fields are clear according to the above rule, then there is no need to ask.
And there is another special case: If a field is discarded that is not explicitly marked as discardable, then as a precaution an inquiry is always done. Although no data is lost, but the data would not be available in the application, so better ask.
There are options to control this a bit finer. You can optionally specify a double as a threshold (from 0.0 to 1.0, otherwise Exception): The value determines how similar two matching fields automatically have to be so that they are not inquired. Example: The value is 0.9 (90%), but a match found is only 0.8 (80%) similar. This is according to the specification too little, there must be an inquiry as a precaution. If you specify 1.0, that means: always ask, everything is really perfectly clear. If you enter 0.0, this means: never ask or only for implicitly dropping fields.
Looks like this:
Here a small example with a Person
class.
It should be changed to:
Without explicitly predefined mappings, the inquiry would look like this:
customerid
and pin
are too different to be automatically assigned to each other. Therefore, it is wrongly assumed that customerid
is omitted and pin
is new. comment
and commerceId
are surprisingly similar (75%) and are therefore assigned.
But that's not what we want.
Incidentally, it would not matter here what is defined as a threshold: customerid
would be eliminated by the implicit decision. This is too delicate not to inquire, so it is always necessary to ask.
To get the mapping right, you have to specify two entries:
customerid
is now called pin
and comment
should be omitted
Then the inquiry looks like this:
Due to the explicit mapping from customerid
to pin
, the similarity does not matter, it is the mapping that matters. To indicate this, it says "[mapped]" instead of the similarity. The rest is as usual. Only comment is now "[discarded]", according to the mapping. The difference to the above is namely: This is an explicitly predetermined absence. That does not force inquiry.
This clears the way for the threshold:
If you enter 0.7 or more then you will be asked. As far as everything would be clear, but the mapping of surname
to lastName
is below the required "minimum similarity", so rather ask.
If you enter 0.6 or less, you will no longer be asked. Because all assignments are either explicitly specified or they are according to "minimum similarity" similar enough to rely on it.
A recommendation for a good value for the "minimum similarity" is difficult. As soon as one softens rules, there is always the danger of a mistake. See comment
example above: is 75% similar to commerceId
. Still wrong. Then prefer 80%? Or 90%? Of course it is better, but the danger is still there.
If you want to be sure, just make 1.0 or omit the parameter, then by default 1.0 is taken.
The most important is the explicit mapping anyway : if "enough" is given by the user, there is no need to ask.
Refactoring V2
This is the manual for older MicroStream versions (Version < 5.0).
The new documentation (Version >= 5.0) is located at:
If one or more fields in a class have changed, the data structure of this class doesn't match anymore with the records in the database. This renders the application and the database incompatible.
It's like in an IDE. You change the structure of a class and the tooling takes care of the rest. Problem is, in a database, the "rest" can be, in some circumstances, several gigabytes or even more, that have to be refactored and written again. It's one way to do it, but there are better alternatives.
At best, the data is transformed when it's accessed only. The old (legacy) type data is being mapped to the new type when it's being loaded, hence: Legacy Type Mapping.
Nothing needs to be rewritten. All records are, as they were saved, compatible with all other versions of their type. Simply by mapping while loading.
What has to be done to achieve this? In the most common cases, nothing!
The heuristic attempts to automatically detect which fields are new, have been removed, reordered or altered.
The fields in the Contact
entity class have been renamed, reordered, one was removed, one is new.
What the heuristic is doing now is something like this:
String firstname
is equal in both classes, so it has to be the same, pretty much as int age
.
name
and lastname
is pretty similar, type is the same too. If there is nothing better for the two, they probably belong together. Same with the other two fields.
In the end, the ominous link
and postalAddress
remain.
The heuristic can not make sense of that, so it assumes that one thing falls away and the other one is added. In this particular example, that worked perfectly. Well done, heuristic.
But:
Just as people can make mistakes in estimating similarities ("I would have thought ..."), even programs can make mistakes as soon as they logically go on thin ice. There is nothing more with absolute correctness that you actually know from (bug-free) software. Such a similarity matching will be correct in the most cases, but sometimes it will also fall by the wayside.
Example: perhaps only PostalAddress
instances were referenced in the application under link
and the two fields would actually be the same, only now properly typed and named. How should heuristics know that? Nobody could know that either, if he is not privy to the details of the concrete application.
That's why Legacy Type Mapping has two mechanisms that prevent things from going wrong:
A callback interface is used to create the desired mapping result: PersistenceLegacyTypeMappingResultor
Optionally, an explicit mapping can be specified, which is then preferred to the heuristic approach.
If you do not want that, you can simply set another resultor. Like in this example each suspected mapping is submitted once to the user for control in the console. This is done with the InquiringLegacyTypeMappingResultor
.
Maybe even one, where the user can "rewire" the mapping itself, write out the mapping, and then return an appropriate result.
All you need is two columns of strings: from old to new. By default MicroStream uses a CSV file, but you can also write something else. In the end, a lot of string pairs for "old -> new" mappings have to come into the program somewhere.
The concept is simple:
If there are two strings, this is interpreted as a mapping from an old thing to a new thing.
If the second value is missing, it is interpreted as an old thing to be deleted.
Missing the first value, then it's as a new thing.
Why call it "thing"? Because this applies to several structural elements:
Constant identifier
Class names
Field names
Example:
count; articleCount
means: the field named earlier count
is called articleCount
in the current version of the class. count;
means: the early field count
should be ignored during the mapping. More specifically, the values of this field per record. ;articleCount
means, this is a newly added field, DO NOT try to match it with anything else heuristically.
You can also mix explicit mapping and heuristics. Only explicitly specify so many changes until the analysis gets the rest right by itself. That means you never have to specify the annoying trivial cases explicitly. Only the tricky ones. Usually, nothing should be necessary at all, or maybe a view indications at most to avoid mishaps.
However, those who strictly prefer to make any change explicitly, instead of trusting a "guessing" software, can also do that. No problem.
For class names, the three variants map, add and remove are somewhat tricky in meaning: Map is just old -> new, same as with fields. To make an entry for a new class doesn't make sense. It's covered by the new class itself. You can do it, but it has no effect. Marking a removed class as deleted makes no sense either, except one special case.
It is not required to specify the fields mapping of mapped classes if the mapping heuristic can do a correct field mapping. Especially if classes have been renamed only.
Classes are simply referred to by their full qualified class name:
com.my.app.entities.Order
In some cases you need to specify the exact Version of the class, then the TypeId has to be prepended:
1012345:com.my.app.entities.Order
Mapping from old to new:
com.my.app.entities.Order;com.my.app.entities.OrderImplementation
For fields it's a bit more complex.
To unambiguously refer a field, the full qualified name of its defining class has to be used.
com.my.app.entities.Order#count;com.my.app.entities.Order#articleCount
The #
is based on official Java syntax, like e.g. in JavaDoc.
If inheritance is involved, which must be uniquely resolved (each class in the hierarchy can have a field named "count"), you must also specify the declaring class. Like this:
com.my.app.entities.Order#com.my.app.entities.ArticleHolder#count; ⤦ com.my.App.entities.Order#com.my.app.entities.ArticleHolder#articleCount
A simple example:
So far so good, all classes and fields are getting mapped, automatically or manually. But what about the data? How are the values getting transformed from old to new? Technically speaking it's done fully automatic. But there are some interesting questions:
Let's say int
to float
. Just to copy the four bytes would yield wrong results. It has to be converted, like float floatValue = (float)intValue;
Can it be done? Yes, fully automatic.
The class BinaryValueTranslators
does the job for you, it has a converter function from each primitive to another.
Currently MicroStream supports conversion between primitives and their wrapper types, and vice versa.
When converting a wrapper to a primitive, null
is converted to 0
.
If you need special conversions between object types, you can add custom BinaryValueSetter
for that, see customizing.
How fast is that?
The type analysis happens only once during initialization. If no exception occurs, the Legacy Type Mapping is ready-configured for each necessary type and will then only be called if required. For normal entity classes that are parsed by reflection, legacy type mapping loading is just as fast as a normal load. An array of such value translator functions is put together once and they are run through each time they are loaded. With legacy mapping, only the order and the target offsets are different, but the principle is the same as with normal loading.
For custom handlers an intermediate step is necessary: First put all the old values together in an order that the custom handler expects and then read the binary data normally, as if loading a record in the current format. That's necessary because MicroStream can't know what such a custom handler does internally. If someone ever uses such a custom handler, the small detour is not likely to be noticeable in terms of performance. And if it should be the case and it has a negative effect on the productive operation: No problem, because: Of course you can also write a custom legacy type handler. It would run at full speed even with tricky special cases.
Of course there is the possibility, as always, of intervening in the machinery massively with customizing.
If you need the highest possible performance for some cases, or for logging / debugging, or anyway: Register any value translator implementations. In the simplest case this is 1 line of code, so do not worry. Being able to specify refactoring mapping in a different way than a CSV file is another example. You can even customize (extend or replace) the strategy that is looked up in refactoring mapping.
Furthermore, you can also replace the heuristic logic with your own. This is easier than it sounds. This is just a primitive little interface (PersistenceMemberSimilator
) and the default implementation thereof calls e.g. just a Levenshtein algorithm for names. You can certainly do that 10 times more clever. Or "more appropriate" for a particular application or programming style. E.g. utilize annotations.
The basic statement is: If there is a problem somewhere, whether with the heuristic or a special case request or performance problem loading a gazillion entities all at once, or if there is a need for debugging in depth or something like that: do not panic. Most likely, this is easily possible with a few lines of code.
Customizing examples:
More information about customizing in general:
You can not just mark classes as deleted. As long as there are records of a certain type in the database, the corresponding class must also exist so that the instances from the database can be loaded into the application. If there are no more records, then that means that there are only a few bytes of orphaned description in the type dictionary, but nobody cares. Is it possible to delete it by hand (or rather not, there are good reasons against it) or you can just ignore it and leave it there forever. In both cases, you must not mark a class as deleted.
Now the special case:
In the entity graph (root instances and all recursively reachable instances from there) all references to instances of a certain type are filled in. It's done by the application logic or possibly by a specially written script. That is, all instances of this type are unreachable. No instance is available, no instance can ever be reloaded. This means that the type is "deleted" from the database at the logical level. One does not have to register anywhere, that is implicitly just like that. You can actually delete the corresponding Java class from the application project because it will never be needed again during the loading process at runtime.
So far so good.
There is only one problem: even if the instances are never logically accessible again: the data records are still around in the database files. The initialization scans over all database files, registers all entities, collects all occurring TypeIds and ensures for every TypeId that there is a TypeHandler
for it. If necessary, a LegacyTypeHandler
with mapping, but still: there must be a TypeHandler
for each TypeId. And a TypeHandler
needs a runtime type. That is, ass-backwards, over records that are logically already deleted, but only physically still lying around, now it is again enforced that the erasable entity class must be present. Bummer. One can prevent this: there is a "cleanup" function in the database management logic, which cleans up all logical gaps in the database files (actually copies all non-gaps into a new file and thus deletes the old file altogether). You would have to call it, then all physical occurrences of the unreachable records disappear and you could easily delete the associated class. But that is annoying.
That is why it makes sense for these cases - and only for them - to do the following:
If you as a developer are absolutely sure that no single instance of a given class is ever reachable again, i.e. must be loaded, then you can mark a type as "deleted" (rather "unreachable") in the refactoring mapping. Then the Type Handling will create a dummy TypeHandler
that does not need a runtime class. See PersistenceUnreachableTypeHandler
. But be careful: if you are mistaken and an instance of such a type is still referenced somewhere and eventually loaded later at runtime, then the Unreachable handler will throw an exception. At some point during the runtime of the application, not even during initialization. The cleanup provides real security: remove all logical gaps and if then with a deleted class no more error in the initialization occurs, it is really superfluous.
Any ideas, such as simply returning null
in the dummy type handler instead of an instance, are a fire hazard: it may dissolve some annoying situations pleasantly, but it would also mean that existing datasets, potentially entire subgraphs, become hidden from the application. Nevertheless, the database would continue to drag them along, perhaps becoming inexplicably large, and any search for the reason would yield nothing, because the dummy type handler keeps the data records secret. Shortsighted great, but catastrophic in the long run. That's not good. The only clean solution is: you have to know what to do with your data model. As long as there are still available instances, they must also be loadable. The annoying special case above can be defused without side effects. But it can not be more than that, otherwise it will get rid of the chaos, problems and lost confidence in the correctness of the database solution.