Who owns berkeley db




















Regardless of the technique used, it's difficult to think clearly about program architecture after code debugging begins, not to mention that large architectural changes often waste previous debugging effort. Software architecture requires a different mind set from debugging code, and the architecture you have when you begin debugging is usually the architecture you'll deliver in that release. Why architect the transactional library out of components rather than tune it to a single anticipated use?

There are three answers to this question. First, it forces a more disciplined design. Second, without strong boundaries in the code, complex software packages inevitably degenerate into unmaintainable piles of glop. Third, you can never anticipate all the ways customers will use your software; if you empower users by giving them access to software components, they will use them in ways you never considered.

In subsequent sections we'll consider each component of Berkeley DB, understand what it does and how it fits into the larger picture. The Berkeley DB access methods provide both keyed lookup of, and iteration over, variable and fixed-length byte strings. The main difference between Btree and Hash access methods is that Btree offers locality of reference for keys, while Hash does not. This implies that Btree is the right access method for almost all data sets; however, the Hash access method is appropriate for data sets so large that not even the Btree indexing structures fit into memory.

At that point, it's better to use the memory for data than for indexing structures. This trade-off made a lot more sense in when main memory was typically much smaller than today. The difference between Recno and Queue is that Queue supports record-level locking, at the cost of requiring fixed-length values. Recno supports variable-length objects, but like Btree and Hash, supports only page-level locking.

We originally designed Berkeley DB such that the CRUD functionality create, read, update and delete was key-based and the primary interface for applications. We subsequently added cursors to support iteration. That ordering led to the confusing and wasteful case of largely duplicated code paths inside the library. Over time, this became unmaintainable and we converted all keyed operations to cursor operations keyed operations now allocate a cached cursor, perform the operation, and return the cursor to the cursor pool.

This is an application of one of the endlessly-repeated rules of software development: don't optimize a code path in any way that detracts from clarity and simplicity until you know that it's necessary to do so.

Software architecture does not age gracefully. Software architecture degrades in direct proportion to the number of changes made to the software: bug fixes corrode the layering and new features stress design. Deciding when the software architecture has degraded sufficiently that you should re-design or re-write a module is a hard decision. On one hand, as the architecture degrades, maintenance and development become more difficult and at the end of that path is a legacy piece of software maintainable only by having an army of brute-force testers for every release, because nobody understands how the software works inside.

On the other hand, users will bitterly complain over the instability and incompatibilities that result from fundamental changes. As a software architect, your only guarantee is that someone will be angry with you no matter which path you choose. We omit detailed discussions of the Berkeley DB access method internals; they implement fairly well-known Btree and hashing algorithms Recno is a layer on top of the Btree code, and Queue is a file block lookup function, albeit complicated by the addition of record-level locking.

Over time, as we added additional functionality, we discovered that both applications and internal code needed the same top-level functionality for example, a table join operation uses multiple cursors to iterate over the rows, just as an application might use a cursor to iterate over those same rows.

It doesn't matter how you name your variables, methods, functions, or what comments or code style you use; that is, there are a large number of formats and styles that are "good enough. Skilled programmers derive a tremendous amount of information from code format and object naming. You should view naming and style inconsistencies as some programmers investing time and effort to lie to the other programmers, and vice versa. Failing to follow house coding conventions is a firing offense. For this reason, we decomposed the access method APIs into precisely defined layers.

These layers of interface routines perform all of the necessary generic error checking, function-specific error checking, interface tracking, and other tasks such as automatic transaction management. When applications call into Berkeley DB, they call the first level of interface routines based on methods in the object handles. One of the Berkeley DB tasks performed in the interface layer is tracking what threads are running inside the Berkeley DB library.

This is necessary because some internal Berkeley DB operations may be performed only when no threads are running inside the library. Berkeley DB tracks threads in the library by marking that a thread is executing inside the library at the beginning of every library API and clearing that flag when the API call returns. The obvious question is "why not pass a thread identifier into the library, wouldn't that be easier?

But, that change would have modified every single Berkeley DB application, most of every application's calls into Berkeley DB, and in many cases would have required application re-structuring. Software architects must choose their upgrade battles carefully: users will accept minor changes to upgrade to new releases if you guarantee compile-time errors, that is, obvious failures until the upgrade is complete; upgrade changes should never fail in subtle ways. But to make truly fundamental changes, you must admit it's a new code base and requires a port of your user base.

Obviously, new code bases and application ports are not cheap in time or resources, but neither is angering your user base by telling them a huge overhaul is really a minor upgrade. Another task performed in the interface layer is transaction generation. The Berkeley DB library supports a mode where every operation takes place in an automatically generated transaction this saves the application having to create and commit its own explicit transactions.

Supporting this mode requires that every time an application calls through the API without specifying its own transaction, a transaction is automatically created. In Berkeley DB there are two flavors of error checking—generic checks to determine if our database has been corrupted during a previous operation or if we are in the midst of a replication state change for example, changing which replica allows writes.

There are also checks specific to an API: correct flag usage, correct parameter usage, correct option combinations, and any other type of error we can check before actually performing the requested operation. This decomposition evolved during a period of intense activity, when we were determining precisely what actions we needed to take when working in replicated environments. After iterating over the code base some non-trivial number of times, we pulled apart all this preamble checking to make it easier to change the next time we identified a problem with it.

There are four components underlying the access methods: a buffer manager, a lock manager, a log manager and a transaction manager. We'll discuss each of them separately, but they all have some common architectural features. First, all of the subsystems have their own APIs, and initially each subsystem had its own object handle with all methods for that subsystem based on the handle.

For example, you could use Berkeley DB's lock manager to handle your own locks or to write your own remote lock manager, or you could use Berkeley DB's buffer manager to handle your own file pages in shared memory. This architectural feature enforces layering and generalization.

Even though the layer moves from time-to-time, and there are still a few places where one subsystem reaches across into another subsystem, it is good discipline for programmers to think about the parts of the system as separate software products in their own right.

Second, all of the subsystems in fact, all Berkeley DB functions return error codes up the call stack. As a library, Berkeley DB cannot step on the application's name space by declaring global variables, not to mention that forcing errors to return in a single path through the call stack enforces good programmer discipline.

In library design, respect for the namespace is vital. Programmers who use your library should not need to memorize dozens of reserved names for functions, constants, structures, and global variables to avoid naming collisions between an application and the library.

Finally, all of the subsystems support shared memory. Because Berkeley DB supports sharing databases between multiple running processes, all interesting data structures have to live in shared memory.

The most significant implication of this choice is that in-memory data structures must use base address and offset pairs instead of pointers in order for pointer-based data structures to work in the context of multiple processes. In other words, instead of indirecting through a pointer, the Berkeley DB library must create a pointer from a base address the address at which the shared memory segment is mapped into memory plus an offset the offset of a particular data structure in that mapped-in segment.

To support this feature, we wrote a version of the Berkeley Software Distribution queue package that implemented a wide variety of linked lists. Before we wrote a shared-memory linked-list package, Berkeley DB engineers hand-coded a variety of different data structures in shared memory, and these implementations were fragile and difficult to debug.

The shared-memory list package, modeled after the BSD list package queue. Once it was debugged, we never had to debug another shared memory linked-list problem. This illustrates three important design principles: First, if you have functionality that appears more than once, write the shared functions and use them, because the mere existence of two copies of any specific functionality in your code guarantees that one of them is incorrectly implemented.

Second, when you develop a set of general purpose routines, write a test suite for the set of routines, so you can debug them in isolation. Third, the harder code is to write, the more important for it to be separately written and maintained; it's almost impossible to keep surrounding code from infecting and corroding a piece of code.

The Berkeley DB Mpool subsystem is an in-memory buffer pool of file pages, which hides the fact that main memory is a limited resource, requiring the library to move database pages to and from disk when handling databases larger than memory. Caching database pages in memory was what enabled the original hash library to significantly out-perform the historic hsearch and ndbm implementations. The advantage of this representation is that a page can be flushed from the cache without format conversion; the disadvantage is that traversing an index structures requires costlier repeated buffer pool lookups rather than cheaper memory indirections.

There are other performance implications that result from the underlying assumption that the in-memory representation of Berkeley DB indices is really a cache for on-disk persistent data. For example, whenever Berkeley DB accesses a cached page, it first pins the page in memory. This pin prevents any other threads or processes from evicting it from the buffer pool. Even if an index structure fits entirely in the cache and need never be flushed to disk, Berkeley DB still acquires and releases these pins on every access, because the underlying model provided by Mpool is that of a cache, not persistent storage.

Mpool assumes it sits atop a filesystem, exporting the file abstraction through the API. The get and put methods are the primary Mpool APIs: get ensures a page is present in the cache, acquires a pin on the page and returns a pointer to the page.

When the library is done with the page, the put call unpins the page, releasing it for eviction. Early versions of Berkeley DB did not differentiate between pinning a page for read access versus pinning a page for write access. However, in order to increase concurrency, we extended the Mpool API to allow callers to indicate their intention to update a page. This ability to distinguish read access from write access was essential to implement multi-version concurrency control.

A page pinned for reading that happens to be dirty can be written to disk, while a page pinned for writing cannot, since it may be in an inconsistent state at any instant.

Berkeley DB uses write-ahead-logging WAL as its transaction mechanism to make recovery after failure possible. The term write-ahead-logging defines a policy requiring log records describing any change be propagated to disk before the actual data updates they describe.

Berkeley DB's use of WAL as its transaction mechanism has important implications for Mpool, and Mpool must balance its design point as a generic caching mechanism with its need to support the WAL protocol. Berkeley DB writes log sequence numbers LSNs on all data pages to document the log record corresponding to the most recent update to a particular page.

Enforcing WAL requires that before Mpool writes any page to disk, it must verify that the log record corresponding to the LSN on the page is safely on disk. The design challenge is how to provide this functionality without requiring that all clients of Mpool use a page format identical to that used by Berkeley DB. Mpool addresses this challenge by providing a collection of set and get methods to direct its behavior. If the method is never called, Mpool does not enforce the WAL protocol.

These APIs allow Mpool to provide the functionality necessary to support Berkeley DB's transactional requirements, without forcing all users of Mpool to do so. Write-ahead logging is another example of providing encapsulation and layering, even when the functionality is never going to be useful to another piece of software: after all, how many programs care about LSNs in the cache?

Regardless, the discipline is useful and makes the software easier to maintain, test, debug and extend. Like Mpool, the lock manager was designed as a general-purpose component: a hierarchical lock manager see [ GLPT76 ] , designed to support a hierarchy of objects that can be locked such as individual data items , the page on which a data item lives, the file in which a data item lives, or even a collection of files.

As we describe the features of the lock manager, we'll also explain how Berkeley DB uses them. However, as with Mpool, it's important to remember that other applications can use the lock manager in completely different ways, and that's OK—it was designed to be flexible and support many different uses.

Lockers are bit unsigned integers. Berkeley DB divides this bit name space into transactional and non-transactional lockers although that distinction is transparent to the lock manager. When Berkeley DB uses the lock manager, it assigns locker IDs in the range 0 to 0x7fffffff to non-transactional lockers and the range 0x to 0xffffffff to transactions. For example, when an application opens a database, Berkeley DB acquires a long-term read lock on that database to ensure no other thread of control removes or renames it while it is in-use.

As this is a long-term lock, it does not belong to any transaction and the locker holding this lock is non-transactional. So applications need not implement their own locker ID allocator, although they certainly can. Lock objects are arbitrarily long opaque byte-strings that represent the objects being locked.

When two different lockers want to lock a particular object, they use the same opaque byte string to reference that object. That is, it is the application's responsibility to agree on conventions for describing objects in terms of opaque byte strings.

This structure contains three fields: a file identifier, a page number, and a type. In almost all cases, Berkeley DB needs to describe only the particular file and page it wants to lock. Berkeley DB assigns a unique bit number to each database at create time, writes it into the database's metadata page, and then uses it as the database's unique identifier in the Mpool, locking, and logging subsystems.

Not surprisingly, the page number indicates which page of the particular database we wish to lock. However, we can also lock other types of objects as necessary. Berkeley DB's choice to use page-level locking was made for good reasons, but we've found that choice to be problematic at times. Page-level locking limits the concurrency of the application as one thread of control modifying a record on a database page will prevent other threads of control from modifying other records on the same page, while record-level locks permit such concurrency as long as the two threads of control are not modifying the same record.

Page-level locking enhances stability as it limits the number of recovery paths that are possible a page is always in one of a couple of states during recovery, as opposed to the infinite number of possible states a page might be in if multiple records are being added and deleted to a page.

As Berkeley DB was intended for use as an embedded system where no database administrator would be available to fix things should there be corruption, we chose stability over increased concurrency. The last abstraction of the locking subsystem we'll discuss is the conflict matrix.

A conflict matrix defines the different types of locks present in the system and how they interact. Let's call the entity holding a lock, the holder and the entity requesting a lock the requester, and let's also assume that the holder and requester have different locker ids. The conflict matrix is an array indexed by [requester][holder] , where each entry contains a zero if there is no conflict, indicating that the requested lock can be granted, and a one if there is a conflict, indicating that the request cannot be granted.

The lock manager contains a default conflict matrix, which happens to be exactly what Berkeley DB needs, however, an application is free to design its own lock modes and conflict matrix to suit its own purposes. Fail over recovery support Table 2 shows how these features are distributed between the different BDB family members. All editions of Berkeley DB are freely available for download and can be used in open source products which are not distributed to third parties.

A commercial license is necessary for using any of the BDB editions in a closed source and packaged product. When closing an Environment or Database or when we commit a Transaction in a multi thread application we should ensure that no thread still has in-progress tasks. To create in-memory database we can use DatabaseConfig. The environment path should point to an already existing directory, otherwise the application will face and exception.

When we create an environmnt object for the first time, necessary files are created inside that direcory. Java annotations are used to define metadata like relations between objects Field refactoring is supported without changing the stored date. Called mutation Table 6 and Table 7 list the features that mostly determine when we should use which API.

Environment, Database, and EntityStore are thread safe meaning that we can use them in multiple threads without manual synchronization. Once a transaction is committed, the transaction handle is no longer valid and a new transaction object is required for further transactional activities. PrimaryKey Defines the class primary key and must be used one and only one time for every entity class.

SecondaryKey Declares a specific data member in an entity class to be a secondary key for that object. This annotation is optional, and can be used multiple times for an entity class. Persistent Declares a persistent class which lives in relation to an entity class. NotTransient Defines a field as being persistent even when it is declared with the transient keyword.

NotPersistent Defines a field as being non-persistent even when it is not declared with the transient keyword. KeyField Indicates the sorting position of a key field in a composite key class when the Comparable interface is not implemented. The KeyField integer element specifies the sort order of this field within the set of fields in the composite key.

Multiple processes can open a database as long as only one process opens it in read-write mode and other processes open the database in read-only mode. Section To stay compatible with Java Collections, Transaction is supported using TransactionWorker and TransactionRunner which the former one is the interface which we can implement to execute our code in a transaction and later one process the transaction.

Keys and values are represented as Java objects. Custom binding can be defined to bind the stored bytes to any type or format like XML, for example. Data binding should be defined to instruct the Collections API about how keys and values are represented as stored data and how stored data is converted to and from Java objects.

We can use one of the two SerialBinding, TupleBinding default data bindings or a custom data binding. Collections API extends Java serialization to store class description separately to make data records much more compact. First lets analyze the TransactionWorker implementation. Entry iter. Backup and Recovery We can simply backup the BDB databases by creating an operating system level copy of all jdb files.

Tuning Berkeley DB JE has 3 daemon threads and configuring these threads affects the overall application performance and behavior. These 3 threads are as follow: Cleaner Thread Responsible for cleaning and deleting unused log files. This thread is run only if the environment is opened for write access. Checkpointer Thread Basically keeps the BTree shape consistent.

Checkpointer thread is triggered when environment opens, environment closes, and database log file grows by some certain amount. Compressor Thread For cleaning the BTree structure from unused nodes. If it performs the checks little by little it will ensure a faster application startup but will consume more resources specially IO. To determine the ideal cache size we should put the application in the production environment and monitor its behavior.

Helper Utilities Three command line utilities are provided to facilitate dumping the databases from one environment, verifying the database structure, and loading the dump into another environment. DbDump Dumps a database to a user-readable format. DbLoad Loads a database from the DbDump output. DbVerify Verifies the structure of a database. To run each of these utilities, switch to BDB JE directory, switch to lib directory and execute as shown in the following command: java -cp je Several examples for different set of functionalities are provided inside the examples directory of the BDB JE package.

Like This Refcard? Free DZone Refcard. Let's be friends:. Automatic and extendable data binding between Objects and underlying storage. Until now, they have not had to worry much about compliance with the terms of the license because they never "redistributed" the source of their Web apps -- they were simply run on servers and accessed remotely by users.

But the terms of the AGPL additionally stipulate that remote usage of the software becomes a trigger for license compliance.

First, they now need to make full corresponding source to their Web application available. Second, they need to ensure the full app -- previously considered an internal-use asset -- has compatible and compliant licensing.

Oracle provided no rationale for the licensing change, but it may well be intended as a spur to further proprietary licensing. But there alternatives.

Additionally, there are numerous other embedded databases , although many are SQL-based rather than simple key-value store databases.

Oracle is entirely within its rights to change the license without warning; the company owns the full copyright to the code. But many will view the change as a hostile act intended to force them into a proprietary licensing relationship with Oracle. I can't help but think this betrayal of trust will drive adoption of alternatives instead.



0コメント

  • 1000 / 1000