Wednesday 4 February 2009

A custom namespacing system for Python, part 2

This post continues on from A custom namespacing system for Python, part 1.

Disclaimer

Feedback on these custom namespacing system posts has been almost all negative. Comments range from suggesting I use __import__ instead, to suggesting that doing this is outright wrong.

One assumption seems to be that I do not know about __import__, which is incorrect. Another seems to be a disbelief that there should be any attempt to do something different in Python, like this for instance. Another might be that I use something like this because I don't like using the standard system which comes with Python. I can't guess what anyone reads into a blog post any more than a reader can guess why anyone would document the implementation of a system like this.

To me, one of the best aspects of the Python programming is its ability to do meta-programming. The fact that things like the package system are not forced on you, and you have the freedom and flexibility to create something like this.

Making the system more usable

In any case, the goal of the system was that all files within a given directory contributed the objects created within them to the same namespace, where the namespace was to match the directory hierarchy. However, in order to be usable, a system like this requires some additional functionality.

Namely:

  • Dependency resolution.
  • Intelligent filtering of namespace elements.
Dependency resolution

The current version of LoadScript in the ScriptDirectory class looks like this:
def LoadScript(self, filePath, namespacePath):
scriptFile = ScriptFile(filePath)
namespace = self.CreateNamespace(namespacePath)
self.InsertModuleAttributes(scriptFile.scriptGlobals, namespace)

return scriptFile
The problem is that loading a set of scripts in this way prevents dependencies existing between them.This is an unrealistic constraint. For any reasonably complex set of scripts, there are going to be dependencies and one of the most common cases will be classes defined in one script subclassing classes defined in another. Dependency resolution is required for this system to be usable.

If the execution related aspects are removed from the loading of scripts, then all scripts can be prepared before any are executed. The next step is then to do all the execution as a batch, with dependencies resolved as part of the process.

So LoadScript needs to be broken into two parts. The new version of LoadScript should be limited to loading the code and the execution related aspects can be put into a new function called RunScript.
def LoadScript(self, filePath, namespacePath):
return ScriptFile(filePath, namespacePath)

def RunScript(self, scriptFile):
scriptFile.Run()

namespace = self.CreateNamespace(scriptFile.namespacePath)
self.InsertModuleAttributes(scriptFile.scriptGlobals, namespace)
As LoadScript delegates the actual loading and execution to the ScriptFile class, this needs to be split up in the same way.

The current version of Load in the ScriptFile class looks like this:
def Load(self, filePath):
self.filePath = filePath

script = open(self.filePath, 'r').read()
self.codeObject = compile(script, self.filePath, "exec")

self.scriptGlobals = {}
eval(self.codeObject, self.scriptGlobals, self.scriptGlobals)
This needs to be broken into two parts in the same way. A Load function to read in and compile the script file's source code and a Run function to attempt to execute the resulting compiled code.

However, the dependency resolution process will need to track the files which failed to run. And if there turn out to be script files which the dependencies cannot be located for preventing the startup process from being completed, knowing what those files were trying is essential to any programmer using this system being able to work out what they did wrong. So we will handle both these aspects by returning a flag to indicate success, and on failure, storing information about import failures.
def Load(self, filePath):
self.filePath = filePath

script = open(self.filePath, 'r').read()
self.codeObject = compile(script, self.filePath, "exec")

def Run(self):
self.scriptGlobals = {}
try:
eval(self.codeObject, self.scriptGlobals, self.scriptGlobals)
except ImportError:
self.lastError = traceback.format_exception(*sys.exc_info())
return False

return True
The RunScript which was rewritten above will also need to be changed again to return the success flag up to its caller, but this is a simple change.
def RunScript(self, scriptFile):
if not scriptFile.Run():
return False

namespace = self.CreateNamespace(scriptFile.namespacePath)
self.InsertModuleAttributes(scriptFile.scriptGlobals, namespace)

return True
The next step is to rewrite the Load function in the ScriptDirectory class. Before it was enough to just load all the script files, executing them as part of the process.
def Load(self):
self.LoadDirectory(self.baseDirPath)
Now the two distinct steps need to be handled. The start of Load remains the same, as that now only handles the loading. But the second step of executing the loaded script files while resolving the encountered dependencies needs to follow it.

This can be in a simple manner with a straightforward algorithm.
  1. Make a list of all the known script files.
  2. Try and execute each script file in the list one by one.
    • If a script file is executed successfully, remove it from the list.
  3. Note that one more attempt has been made to execute all the remaining scripts
  4. If more than a reasonable number of attempts have been made, give up.
  5. Otherwise, go back to step 2.
Or as implemented in Python code.
scriptFilesToLoad = set(self.filesByPath.itervalues())
attemptsLeft = self.dependencyResolutionPasses
while len(scriptFilesToLoad) and attemptsLeft > 0:
scriptFilesLoaded = set()
for scriptFile in scriptFilesToLoad:
if self.RunScript(scriptFile):
scriptFilesLoaded.add(scriptFile)

# Update the set of scripts which have yet to be loaded.
scriptFilesToLoad -= scriptFilesLoaded

attemptsLeft -= 1
If this loop exits with scripts remaining to be loaded, then the loading process has failed, and the user should be notified so they can fix their errors, circular dependencies or whatever else they may have done wrong. Each script file will have recorded the error that occurred when it was last executed, so that information can be relayed to the user.
if len(scriptFilesToLoad):
logging.error("ScriptDirectory.Load failed to resolve dependencies")

# Log information about the problematic script files.
for scriptFile in scriptFilesToLoad:
scriptFile.LogLastError()
The LogLastError function is also rather straightforward.
def LogLastError(self, flush=True):
if self.lastError is None:
logging.error("Script file '%s' unexpectedly missing a last error", self.filePath)
return

logging.error("Script file '%s'", self.filePath)
for line in self.lastError:
logging.error("%s", line.rstrip("\r\n"))

if flush:
self.lastError = None
The function which created the ScriptDirectory instance and asked it to load also needs to be able to tell that the process failed. Adding the return of success flags, finishes off Load.
def Load(self):
## Pass 1: Load all the valid scripts under the given directory.
self.LoadDirectory(self.baseDirPath)

## Pass 2: Execute the scripts, ordering for dependencies and then add the namespace entries.
scriptFilesToLoad = set(self.filesByPath.itervalues())
attemptsLeft = self.dependencyResolutionPasses
while len(scriptFilesToLoad) and attemptsLeft > 0:
scriptFilesLoaded = set()
for scriptFile in scriptFilesToLoad:
if self.RunScript(scriptFile):
scriptFilesLoaded.add(scriptFile)

# Update the set of scripts which have yet to be loaded.
scriptFilesToLoad -= scriptFilesLoaded

attemptsLeft -= 1

if len(scriptFilesToLoad):
logging.error("ScriptDirectory.Load failed to resolve dependencies")

# Log information about the problematic script files.
for scriptFile in scriptFilesToLoad:
scriptFile.LogLastError()

return False

return True
And with the addition of dependency resolution support, the custom namespacing solution is now usable. However, with a sufficiently complex set of scripts, the algorithm may not be sufficient. But that's an easy problem for future users to solve, for now.

Intelligent filtering of namespace elements

A script file can be looked at as containing two different sets of objects. Those which were imported from elsewhere and those which were created within the script file. The only set which should be exported to the namespace the file contributes to, are the latter. The former should be filtered out.

In an ideal world, there would be some way of determining what was actually created locally. But in this world, it is only possible to identify certain kinds of externally sourced objects.

Modules are one of the most commonly imported types of objects. These are never created within a script file, so they can always be filtered out.

if type(v) is types.ModuleType:
continue
Classes are another commonly imported type of object. And it is simple for us to distinguish between the ones which were created locally and the ones which weren't. The __module__ attribute will be "__builtin__" if it was created locally, and it will have already been set to something else if it was imported from somewhere else.
if type(v) in (types.ClassType, types.TypeType):
if v.__module__ != "__builtin__":
continue

v.__module__ = moduleName
The kinds of objects which cannot be filtered out are those which have values that are simple types like strings, numbers and so forth.

A runnable form of the code shown above can be found here.

The followup to this post can be found here.

Edit: Added a linked to the next post in the set.

Monday 2 February 2009

A custom namespacing system for Python, part 1

Disclaimer

Feedback on these custom namespacing system posts has been almost all negative. Comments range from suggesting I use __import__ instead, to suggesting that doing this is outright wrong.

One assumption seems to be that I do not know about __import__, which is incorrect. Another seems to be a disbelief that there should be any attempt to do something different in Python, like this for instance. Another might be that I use something like this because I don't like using the standard system which comes with Python. I can't guess what anyone reads into a blog post any more than a reader can guess why anyone would document the implementation of a system like this.

To me, one of the best aspects of the Python programming is its ability to do meta-programming. The fact that things like the package system are not forced on you, and you have the freedom and flexibility to create something like this.

The standard Python package system

The Python module system works in a certain defined way. An "__init__.py" file needs to be placed in a module directory. Other Python files within the module directory need to be imported by name, and their contents are available under that name.

Let's say there's a directory named "mymodule" with the compulsory "__init__.py" in it. There's also a "myfile.py" there and it has a class MyClass within it. In order to access MyClass, mymodule.myfile needs to be imported and then the class can be accessed as an attribute on it.

import mymodule.myfile
myInstance = mymodule.myfile.MyClass()
Boilerplate code can be placed within "__init__.py" to make MyClass accessible from mymodule directly. One way this can be done is by importing myfile and getting MyClass from it.
# The contents of __init__.py
from myfile import MyClass
Now MyClass can be imported as directly from mymodule.
from mymodule import MyClass
myInstance = MyClass()

import mymodule
myInstance = mymodule.MyClass()
The Python documentation calls this system of structuring module directories packages.

But sometimes you want something simpler. A way of placing code in files which is directly importable in a manner you might prefer. Me, often I don't care what files are in a directory, I just want their contents lumped in a straightforward namespace which maps to the directory structure.

A directory based namespacing system

One way to go about implementing a custom system like this, would be to specify a base module name and a script directory, with the intent that the contents of that directory get placed into a namespace above that base module. This is the system I am going to implement.

Here's an example script directory which would be used with this system:
scripts/fileA.py
scripts/things/fileB.py
scripts/things/fileC.py
"fileA.py" contains the following code:
class Alpha: pass
"fileB.py" contains the following code:
class Beta: pass
"fileC.py" contains the following code:
class Gamma: pass
Given a base module name of 'game', this would generate the following namespace hierarchy:
game
game.Alpha
game.things
game.things.Beta
game.things.Gamma
The module 'game' contains the class Alpha from "fileA.py". And the module 'game.things' the class Beta from "fileB.py" and the class Gamma from "fileC.py".

The first step is to write a function to recursively walk through a directory tree.
def LoadDirectory(dirPath):
for entryName in os.listdir(dirPath):
entryPath = os.path.join(dirPath, entryName)
if os.path.isdir(entryPath):
LoadDirectory(entryPath)
elif os.path.isfile(entryPath):
if entryName.endswith(".py"):
LoadScript(entryPath)
The script loading should read in the given Python script.
script = open(filePath, 'r').read()
Compile it as an executable sequence of statements.
codeObject = compile(script, filePath, "exec")
Execute it to generate the objects the code define.
scriptGlobals = {}
eval(codeObject, scriptGlobals, scriptGlobals)
Giving the LoadScript function.
def LoadScript(filePath):
script = open(filePath, 'r').read()
codeObject = compile(script, filePath, "exec")
scriptGlobals = {}
eval(codeObject, scriptGlobals, scriptGlobals)
return scriptGlobals
Now when the Python script is compiled and executed, a global dictionary scriptGlobals is supplied. The script executes using that global dictionary, and anything the script assigns, results as variables within it. Given the namespace object the script file was supposed to contribute to, all these variables created by it could be placed in that namespace and we'd be done.

So given an absolute script path entryPath, that script would be loaded.
scriptGlobals = LoadScript(entryPath)
And given a namespace path namespacePath (like for instance 'game.things'), the corresponding module object could be obtained or created if needed.
module = GetNamespaceModule(namespacePath)
Then the attributes from the script globals could be transferred over to the namespace module object.
InsertModuleAttributes(scriptGlobals, module)
The namespace path is easily obtained. It is the same for all scripts located within the same directory. The base namespace name is combined with the relative path of the script file from the base script directory to make it.

Given the base script directory of "scripts" and the base namespace name of 'game', the file "scripts/fileA.py" has the namespace path of 'game'. And the file "scripts/things/fileB.py" has the namespace path of 'game.things'.

So building the namespace path for a given directory can be done like this.
namespacePath = baseNamespacePath
relativeDirPath = os.path.relpath(dirPath, baseDirPath)
if relativeDirPath != ".":
namespacePath += "."+ relativeDirPath.replace(os.path.sep, ".")
Combining all the pieces so far results in LoadDirectory looking something like the following.
def LoadDirectory(dirPath):
namespacePath = baseNamespacePath
relativeDirPath = os.path.relpath(dirPath, baseDirPath)
if relativeDirPath != ".":
namespacePath += "."+ relativeDirPath.replace(os.path.sep, ".")

for entryName in os.listdir(dirPath):
entryPath = os.path.join(dirPath, entryName)
if os.path.isdir(entryPath):
LoadDirectory(entryPath)
elif os.path.isfile(entryPath):
if entryName.endswith(".py"):
scriptGlobals = LoadScript(entryPath)
module = GetNamespaceModule(namespacePath)
InsertModuleAttributes(scriptGlobals, module)
The namespace creation is fairly simple, but there are of course things which can go wrong.
  • If one of our modules is inserted into sys.modules, where the importer looks for loaded modules, it will be imported when its name is given to the importer. If the module name also happens to be the name of a standard library module, that standard library module can no longer be imported and anyone who asks for it will get our module.
  • If a normal Python module is already imported and its name corresponds to one of our namespace names, and we grab it to insert our script globals into, we're now polluting that namespace. The official term for this is "module shitting".
As a wise man once said, don't cross the streams.

The best approach to take is to raise an error during the namespace creation process if one of these situations is going to occur. This is a user error, and if the user doesn't understand where their namespaces are going, they're not doing this right.

sys.modules cannot be used as an index of the created namespaces so a custom index is needed.
namespaces = {}
If a custom namespace has already been created, then it is okay to use directly.
module = namespaces.get(namespacePath, None)
if module is not None:
return module
If the namespace path is in sys.modules but wasn't in the custom index, then to put something in there would be module shitting.
if namespacePath in sys.modules:
raise RuntimeError("Module shitting", namespacePath)
A namespace path may be many levels deep. The sub-modules should be recursively built before this child module can be created and added.
parts = namespaceName.rsplit(".", 1)
if len(parts) == 2:
baseNamespaceName, moduleName = parts
baseNamespace = GetNamespaceModule(baseNamespaceName)
else:
baseNamespaceName, moduleName = None, parts[0]
baseNamespace = None
Now either the code has errored because of module shitting, or the parent module has been obtained. The module for this namespace path can be created.
module = imp.new_module(moduleName)
Its name needs to be set.
module.__name__ = moduleName
And the file it comes from. But in our case, there is no file, as all files in a directory contribute to it. So the file name is set to a placeholder that can be recognised.
module.__file__ = "DIRECTORY("+ namespaceName +")"
The module is registered in the custom index.
namespaces[namespaceName] = module
And also in sys.modules, after which it is importable.
sys.modules[namespaceName] = module
The module needs to be accessible as an attribute of its parent module. This way, after importing 'game', you can access 'game.things'.
if baseNamespace is not None:
setattr(baseNamespace, moduleName, module)
Which gives the function GetNamespaceModule.
def GetNamespaceModule(self, namespaceName):
module = namespaces.get(namespaceName, None)
if module is not None:
return module

if namespaceName in sys.modules:
raise RuntimeError("Namespace already exists", namespaceName)

parts = namespaceName.rsplit(".", 1)
if len(parts) == 2:
baseNamespaceName, moduleName = parts
baseNamespace = GetNamespaceModule(baseNamespaceName)
else:
baseNamespaceName, moduleName = None, parts[0]
baseNamespace = None

module = imp.new_module(moduleName)
module.__name__ = moduleName
# Our modules don't map to files. Have a placeholder.
module.__file__ = "DIRECTORY("+ namespaceName +")"

namespaces[namespaceName] = module
sys.modules[namespaceName] = module

if baseNamespace is not None:
setattr(baseNamespace, moduleName, module)

return module
The final step is inserting script globals into a namespace module. This is a simple matter of copying the attributes in the script globals over to the module.

The script globals directory automatically had a __builtins__ reference added to it when the code was executed. This should be ignored.
if k == "__builtins__":
continue
And any classes copied over need to have their __module__ attribute set to the module name.
if type(v) is types.ClassType or type(v) is types.TypeType:
v.__module__ = namespace.__name__
Which gives the function InsertModuleAttributes.
def InsertModuleAttributes(attributes, namespace):
moduleName = namespace.__name__

for k, v in attributes.iteritems():
if k == "__builtins__":
continue

if type(v) is types.ClassType or type(v) is types.TypeType:
v.__module__ = moduleName

logging.info("InsertModuleAttribute "+ k +" "+ namespace.__file__)
setattr(namespace, k, v)
Now after executing the following it should be possible to import from the custom namespaces.
baseNamespacePath = "game"
baseDirPath = os.path.join(sys.path[0], "scripts")
LoadDirectory(baseDirPath)

from game.things import Beta, Gamma
from game import Alpha
A good next step would be to extend this with a code reloading system, where the script directories are monitored and changes to files which contributed to the custom namespaces are applied to running code which uses them.

A runnable form of the code shown above can be found here.

The followup to this post can be found here.