Thursday, January 10, 2013

Rev 3.8 release brings initial I/O library, import clauses

We are pleased to release alpha revision 3.8 of the ParaSail compiler and virtual machine, available at the same URL:

http://bit.ly/Mx9DRb

This release includes a number of new features, including support for the import clause, and a standard library now defining modules with names like PSL::Core::Boolean and PSL::Containers::Set (rather than simply Boolean or Set). There are some new I/O modules, including PSL::Core::IO, PSL::Core::File_Input_Stream, and PSL::Core::File_Output_Stream. These last two support simple textual file I/O. We will be providing additional libraries oriented toward a graphical user interface in an upcoming release. The reference manual has also been updated to cover some of the newer features. As usual, consult the release notes for more details on the 3.8 revision.

We will write a separate blog entry about the design of the import clause -- it brought out some interesting issues. Every ParaSail module has an implicit import of PSL::Core::* and PSL::Containers::* so the introduction of the import clause shouldn't cause any compatibility issues with existing code.

8 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. == Apologies: Some of the symbols in my previous post has been stripped away by the HTML sanitizer ===

    Hi Tucker,

    I came across your paper recently on Parasail and am very interested. I working on programming language design at Berkeley and have also recently become annoyed at the number of bugs that arise due to pointers.

    I would like to know your opinion on the following design problem. Because of the move, copy, and swap semantics of Parasail, the user is limited to creating only tree-like data structures. How do you handle the use cases that are typically solved using parent pointers in other languages?

    For example, it is reasonable for a "Company" object to contain a list of "Employee"s, but shouldn't it also be reasonable to query an "Employee" and ask him what company he works for?

    As another example, we can think of "Directory"s as containing other "Directory"s and "File"s, but wouldn't it be reasonable to be able to ask a "File" for its path?

    Both the above examples are typically solved with a parent pointer in other languages. I am not fond of this solution, but have not yet found a decent alternative and am interested in Parasail's solution.

    Thank you very much
    -Patrick

    ReplyDelete
  3. The simplest solution is to adopt what is done in relational databases, and use "foreign" keys (unique id's) rather than pointers. In fact, it is often the case that with situations like "Companies" and "Employees," a given person might work for multiple companies, and clearly a company has multiple employees, so you end up representing these as separate tables, with sets of unique id's linking them together. That is, you don't store the employee directly in the company data structure, but rather you store the employee's social security #, or equivalent, and then you look that up in your person table.

    With directories and files, you can often simply retain the full name in the in-memory directory/file "handle," so that you can always retrieve it for an open file. For systems like Unix that allow a file to be in multiple directories ("hard links") then you end up storing unique IDs again ("inode numbers" in Unix parlance) rather than files inside directories.

    I thought about allowing some kind of "parent" reference, but it seemed like a relatively special case, and it added a lot of complexity. Variations on the above two approaches (using Unique IDs, or remembering the full "path" as part of the "handle" on an element of a tree) seemed more flexible, ultimately.

    ReplyDelete
  4. My experience with using the foreign key solution is that the logic does not seem so different from just using pointers. For example, I could just use a unique 64-bit integer as my foreign key for all my "Employee"s and "Company"s, and have one large table that associates keys with records. But the logic for manipulating this data structure will be as error-prone as the logic for manipulating a pointer-based structure. It will actually be more error prone, because now I have to be aware of the difference between a key and a record, I have to look up the key in the correct table, and I can no longer rely upon the garbage collector to clean up unreachable records anymore.

    Do you have an example where refactoring a pointer-based implementation of a program into a foreign key implementation of a program results in substantially cleaner code?

    Thank you very much for the quick reply.
    -Patrick

    ReplyDelete
    Replies
    1. Well using one large table is asking for trouble. Better is to use one table for each kind of entity. Then you can use strong type checking to make sure the key types are unique, and then there is no way that you can confuse a key for one table with a key for another table.

      You mention cleaning up unreachable records. That seems a bit surprising. How does a record become unreachable? If you have multiple companies, then you might have people in your database who work for multiple companies, or people who work for no companies (i.e. are unemployed). But they never become "unreachable."

      A database-like solution makes sense when the entities have some kind of independent existence. If employees only "exist" when they are part of some company, then it makes sense to make them "components" of the company, as opposed in some separate table. At which point you might want to switch to the "handle" approach where you never pass around an employee by itself, you only pass around employees along with their current employment information.

      I would be interested in scenarios where you really want a garbage collector to automatically delete long-lived objects. Garbage collectors in my experience are great for deleting short-lived objects which become unreachable, because they are passed around as parameters and then at some point are "dropped" on the floor. But for more persistent data, garbage collectors rarely seem the right approach, as there is more to be done when, say, an employee is "fired."

      As far as refactoring, the important question is what objects have independent existence, and what objects are effectively components. You would use keys and a separate table for the kinds of objects that have independent existence, and you would use containment for objects whose existence is dependent on their container. This very important distinction is often lost when pointers are the primary data structuring capability.

      Delete
  5. Those explanations are very clear. Thank you. I particularly like the distinction between the objects that have independent existence and the objects that are "components". I had a similar but less well-formed thought when I was refactoring a while ago. I classified pointers as being one of two types, one type of pointer indicated "contains", and another type of pointer indicated "linked to". Child pointers would be an example of "contains" pointers, and parent pointers would be an example of "linked to". Using this classification, I made a strict rule for myself that functions may only mutate objects that are "contained" by its arguments, i.e. that only objects reachable by following "contains" pointers from the arguments are allowed to be mutated. When this guideline is not followed, I found that the scope of the effects of a function quickly become unintuitive.

    Your explanation of the tables is very clear, and seems like the reasonable way to do things. Let me give some more details about my example to try and spot where I fell off the correct path.

    I was programming a compiler, and designed a set of AST nodes for use by the compiler. The compilers would read in text, and convert it to AST nodes, and subsequent transformations will be done entirely with the AST nodes. Because of various quirks of the semantics of the language, the AST for a program is not actually a tree but a graph.

    The first incarnation used pointers for everything. AST nodes have child pointers that point to child statements, and parent pointers that point to the containing statement. I did not follow the coding guideline that I specified in the first paragraph and the code became messy quickly.

    After this, I noticed that I conceptually find tables much clearer than objects and pointers, so I refactored the code to represent AST nodes as records in a table, with foreign keys in place of pointers. Here are the problems that I ran into:

    1) Originally, I had different tables for different types of AST nodes. But I noticed that there were quite a few statements in the language where the type would not be known. (i.e. The child could be a FunctionNode, or a VarNode, or a ClassNode!) And I was storing each of them in separate tables! In these circumstances using a foreign key was not enough, I also had to remember which table to look up the key in.

    That was not very elegant, so I eventually decided to put all of the nodes into one table, and have a field that indicates what type of record it is.

    2) Whereas in the pointer-based implementation I was manipulating objects directly, in the table based implementation I was manually responsible for keeping track of keys and records. Thus to get the name of a child statement, I would manually retrieve the child's record by looking up its key. This extra distinction between key and record led to an increase in LOC because of having to look up keys manually. I would also often forget to look up a key before accessing a field, leading to tedious bug hunting.

    3) During program transformations in the original implementation, new nodes were created and existing nodes were modified. Thus, if by following the root node (e.g. the top level FunctionNode) some AST nodes were no longer reachable, they were automatically cleaned up by the garbage collector.
    In the table based implementation, I was manually creating and assigning keys to records. Thus at the end of each transformation step, I was forced to manually do a traversal through the records in the table and clean up the ones that were no longer reachable.

    That was my experience with attempting to refactor my pointer-based code. It did not go well, but I was using a dynamically-typed language. I can understand your point about using a statically-typed language to help many of the above errors.

    I am keen to hear about which parts you would have done differently.

    Thanks very much
    -Patrick

    ReplyDelete
    Replies
    1. I have a couple of questions. What made the AST a graph? Typically those are the declaration nodes. I would recommend keeping the "symbol" table as a separate data structure, with the AST node for a declaration referring to its entry in the symbol table. Other AST nodes that need to refer to a declaration would then refer to the symbol table node, rather than the AST for the declaration.

      The second question is what did you use the parent pointers for? Parent pointers add a lot of complexity to an AST. One reason for them might be for doing symbol lookups in enclosing scopes. Again, a separate data structure for the scope stack is very helpful in my experience, with the various symbol tables connected into the scope stack at appropriate points.

      Once you get to an AST that is a true "tree," then many things become simpler. Replacing part of the tree is easy, because you know nothing is referring to it. And in ParaSail, it easily handles reclaiming storage for replaced subtrees without needing a separate garbage collector. Since it is a true tree, and there are no external pointers into the tree, when a branch is replaced, it can be immediately reclaimed, using ParaSail's region-based storage management.

      I realize I may be oversimplifying your situation, but my experience is that you don't want to intermix symbol tables with the AST. Once you make a clean separation, many things become easier to manage.

      Delete
  6. You are completely correct. I had been using the VarNode, and FunctionNode objects to store information that would traditionally have been stored in a SymbolTable. Because some ASTNodes would operate on VarNodes and FunctionNodes, the resulting AST is a graph and not a tree.

    The parent pointers did, as you predicted, add a lot of complexity to the implementation. I was using them for determining the enclosing scope of variables and functions which were used for some of the analysis stages. During program transformations, if I wasn't careful to update the parent pointers properly, the pointers would often become inconsistent.

    One thing that does bother me is that I haven't determined exactly why parent pointers add so much complexity to the code. Initially, I thought that if I simply use proper OO methodology and hide the handling of parent pointers below a simple API then the complexity would be nicely contained, but this turned out not to be the case. Do you have any thoughts/experience with this?

    I will follow your advice and try to refactor as per your suggestions. I am quite excited about how it might simplify my compiler. So thank you very much. To me it seems that these pointer issues, and data-structure design, affect software complexity much more than functional vs. imperative styles. I will be watching the development of Parasail with great interest.

    Thank you very much again
    -Patrick

    ReplyDelete